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Preface 


The field of industrial electronics covers a plethora of problems that must be solved in industrial prac- 
tice. Electronic systems control many processes that begin with the control of relatively simple devices 
like electric motors, through more complicated devices such as robots, to the control of entire fabrica- 
tion processes. An industrial electronics engineer deals with many physical phenomena as well as the 
sensors that are used to measure them. Thus, the knowledge required by this type of engineer is not only 
traditional electronics but also specialized electronics, for example, that required for high-power appli- 
cations. The importance of electronic circuits extends well beyond their use as a final product in that 
they are also important building blocks in large systems, and thus the industrial electronics engineer 
must also possess knowledge of the areas of control and mechatronics. Since most fabrication processes 
are relatively complex, there is an inherent requirement for the use of communication systems that not 
only link the various elements of the industrial process but are also tailor-made for the specific indus- 
trial environment. Finally, the efficient control and supervision of factories require the application of 
intelligent systems in a hierarchical structure to address the needs of all components employed in the 
production process. This need is accomplished through the use of intelligent systems such as neural 
networks, fuzzy systems, and evolutionary methods. The Industrial Electronics Handbook addresses all 
these issues and does so in five books outlined as follows: 


1. Fundamentals of Industrial Electronics 
2. Power Electronics and Motor Drives 

3. Control and Mechatronics 

4. Industrial Communication Systems 

5. Intelligent Systems 


The editors have gone to great lengths to ensure that this handbook is as current and up to date as pos- 
sible. Thus, this book closely follows the current research and trends in applications that can be found 
in IEEE Transactions on Industrial Electronics. This journal is not only one of the largest engineering 
publications of its type in the world, but also one of the most respected. In all technical categories in 
which this journal is evaluated, it is ranked either number 1 or number 2 in the world. As a result, we 
believe that this handbook, which is written by the world’s leading researchers in the field, presents the 
global trends in the ubiquitous area commonly known as industrial electronics. 

An interesting phenomenon that has accompanied the progression of our civilization is the system- 
atic replacement of humans by machines. As far back as 200 years ago, human labor was replaced first 
by steam machines and later by electrical machines. Then approximately 20 years ago, clerical and sec- 
retarial jobs were largely replaced by personal computers. Technology has now reached the point where 
intelligent systems are replacing human intelligence in decision-making processes as well as aiding in 
the solution of very complex problems. In many cases, intelligent systems are already outperforming 
human activities. The field of computational intelligence has taken several directions. Artificial neural 
networks are not only capable of learning how to classify patterns, for example, images or sequences of 
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xii Preface 


events, but they can also effectively model complex nonlinear systems. Their ability to classify sequences 
of events is probably more popular in industrial applications where there is an inherent need to model 
nonlinear system behavior—as an example, measuring the system parameters that are easily obtainable 
and using a neural network to evaluate parameters that are difficult to measure but essential for system 
control. Fuzzy systems have a similar application. Their main advantage is their simplicity and ease of 
implementation. Various aspects of neural networks and fuzzy systems are covered in Parts I and III. 
Part IV is devoted to system optimization, where several new techniques including evolutionary meth- 
ods, swarm, and ant colony optimizations are covered. Part V is devoted to several applications that deal 
with methods of computational intelligence. 


For MATLAB® and Simulink® product information, please contact 


The MathWorks, Inc. 

3 Apple Hill Drive 

Natick, MA, 01760-2098 USA 
Tel: 508-647-7000 

Fax: 508-647-7001 

E-mail: info@mathworks.com 
Web: www.mathworks.com 


© 2011 by Taylor and Francis Group, LLC 


Acknowledgments 


‘The editors wish to express their heartfelt thanks to their wives Barbara Wilamowski and Edie Irwin for 
their help and support during the execution of this project. 


xiii 


© 2011 by Taylor and Francis Group, LLC 


Mo-Yuen Chow 
North Carolina State University 
Raleigh, North Carolina 


Josef Korbicz 
University of Zielona Gora 
Zielona Gora, Poland 


Kim Fung Man 
City University of Hong Kong 
Kowloon, Hong Kong 


Milos Manic 


University of Idaho, Idaho Falls 
Idaho Falls, Idaho 


© 2011 by Taylor and Francis Group, LLC 


Editorial Board 


Witold Pedrycz 
University of Alberta 
Edmonton, Alberta, Canada 


Ryszard Tadeusiewicz 
AGH University of Science and Technology 
Krakow, Poland 


Paul J. Werbos 
National Science Foundation 
Arlington, Virginia 


Gary Yen 


Oklahoma State University 
Stillwater, Oklahoma 


XV 


Editors 


Bogdan M. Wilamowski received his MS in computer engineering in 
1966, his PhD in neural computing in 1970, and Dr. habil. inintegrated 
circuit design in 1977. He received the title of full professor from the 
president of Poland in 1987. He was the director of the Institute of 
Electronics (1979-1981) and the chair of the solid state electronics 
department (1987-1989) at the Technical University of Gdansk, 
Poland. He was a professor at the University of Wyoming, Laramie, 
from 1989 to 2000. From 2000 to 2003, he served as an associate 
director at the Microelectronics Research and Telecommunication 
Institute, University of Idaho, Moscow, and as a professor in the elec- 
trical and computer engineering department and in the computer sci- 
ence department at the same university. Currently, he is the director 
of ANMSTC—Alabama Nano/Micro Science and Technology Center, Auburn, and an alumna professor 
in the electrical and computer engineering department at Auburn University, Alabama. Dr. Wilamowski 
was with the Communication Institute at Tohoku University, Japan (1968-1970), and spent one year at 
the Semiconductor Research Institute, Sendai, Japan, as a JSPS fellow (1975-1976). He was also a visiting 
scholar at Auburn University (1981-1982 and 1995-1996) and a visiting professor at the University of 
Arizona, Tucson (1982-1984). He is the author of 4 textbooks, more than 300 refereed publications, and 
has 27 patents. He was the principal professor for about 130 graduate students. His main areas of interest 
include semiconductor devices and sensors, mixed signal and analog signal processing, and computa- 
tional intelligence. 

Dr. Wilamowski was the vice president of the IEEE Computational Intelligence Society (2000-2004) 
and the president of the IEEE Industrial Electronics Society (2004-2005). He served as an associate edi- 
tor of IEEE Transactions on Neural Networks, IEEE Transactions on Education, IEEE Transactions on 
Industrial Electronics, the Journal of Intelligent and Fuzzy Systems, the Journal of Computing, and the 
International Journal of Circuit Systems and IES Newsletter. He is currently serving as the editor in chief 
of IEEE Transactions on Industrial Electronics. 

Professor Wilamowski is an IEEE fellow and an honorary member of the Hungarian Academy of 
Science. In 2008, he was awarded the Commander Cross of the Order of Merit of the Republic of Poland 
for outstanding service in the proliferation of international scientific collaborations and for achieve- 
ments in the areas of microelectronics and computer science by the president of Poland. 


Xvii 


© 2011 by Taylor and Francis Group, LLC 


XVili Editors 


J. David Irwin received his BEE from Auburn University, Alabama, 
in 1961, and his MS and PhD from the University of Tennessee, 
Knoxville, in 1962 and 1967, respectively. 

In 1967, he joined Bell Telephone Laboratories, Inc., Holmdel, New 
Jersey, as a member of the technical staff and was made a supervisor 
in 1968. He then joined Auburn University in 1969 as an assistant 
professor of electrical engineering. He was made an associate profes- 
sor in 1972, associate professor and head of department in 1973, and 
professor and head in 1976. He served as head of the Department of 
Electrical and Computer Engineering from 1973 to 2009. In 1993, 
he was named Earle C. Williams Eminent Scholar and Head. From 
1982 to 1984, he was also head of the Department of Computer Science and Engineering. He is currently 
the Earle C. Williams Eminent Scholar in Electrical and Computer Engineering at Auburn. 

Dr. Irwin has served the Institute of Electrical and Electronic Engineers, Inc. (IEEE) Computer 
Society as a member of the Education Committee and as education editor of Computer. He has served 
as chairman of the Southeastern Association of Electrical Engineering Department Heads and the 
National Association of Electrical Engineering Department Heads and is past president of both the 
IEEE Industrial Electronics Society and the IEEE Education Society. He is a life member of the IEEE 
Industrial Electronics Society AdCom and has served as a member of the Oceanic Engineering Society 
AdCom. He served for two years as editor of IEEE Transactions on Industrial Electronics. He has served 
on the Executive Committee of the Southeastern Center for Electrical Engineering Education, Inc., 
and was president of the organization in 1983-1984. He has served as an IEEE Adhoc Visitor for ABET 
Accreditation teams. He has also served as a member of the IEEE Educational Activities Board, and 
was the accreditation coordinator for IEEE in 1989. He has served as a member of numerous IEEE com- 
mittees, including the Lamme Medal Award Committee, the Fellow Committee, the Nominations and 
Appointments Committee, and the Admission and Advancement Committee. He has served as a mem- 
ber of the board of directors of IEEE Press. He has also served as a member of the Secretary of the Army’s 
Advisory Panel for ROTC Affairs, as a nominations chairman for the National Electrical Engineering 
Department Heads Association, and as a member of the IEEE Education Society’s McGraw-Hill/Jacob 
Millman Award Committee. He has also served as chair of the IEEE Undergraduate and Graduate 
Teaching Award Committee. He is a member of the board of governors and past president of Eta Kappa 
Nu, the ECE Honor Society. He has been and continues to be involved in the management of several 
international conferences sponsored by the IEEE Industrial Electronics Society, and served as general 
cochair for IECON’05. 

Dr. Irwin is the author and coauthor of numerous publications, papers, patent applications, and 
presentations, including Basic Engineering Circuit Analysis, 9th edition, published by John Wiley & 
Sons, which is one among his 16 textbooks. His textbooks, which span a wide spectrum of engineering 
subjects, have been published by Macmillan Publishing Company, Prentice Hall Book Company, John 
Wiley & Sons Book Company, and IEEE Press. He is also the editor in chief of a large handbook pub- 
lished by CRC Press, and is the series editor for Industrial Electronics Handbook for CRC Press. 

Dr. Irwin is a fellow of the American Association for the Advancement of Science, the American 
Society for Engineering Education, and the Institute of Electrical and Electronic Engineers. He 
received an IEEE Centennial Medal in 1984, and was awarded the Bliss Medal by the Society of 
American Military Engineers in 1985. He received the IEEE Industrial Electronics Society's Anthony 
J. Hornfeck Outstanding Service Award in 1986, and was named IEEE Region III (U.S. Southeastern 
Region) Outstanding Engineering Educator in 1989. In 1991, he received a Meritorious Service 
Citation from the IEEE Educational Activities Board, the 1991 Eugene Mittelmann Achievement 
Award from the IEEE Industrial Electronics Society, and the 1991 Achievement Award from the IEEE 
Education Society. In 1992, he was named a Distinguished Auburn Engineer. In 1993, he received the 
IEEE Education Society’s McGraw-Hill/Jacob Millman Award, and in 1998 he was the recipient of the 


© 2011 by Taylor and Francis Group, LLC 


Editors xix 


IEEE Undergraduate Teaching Award. In 2000, he received an IEEE Third Millennium Medal and 
the IEEE Richard M. Emberson Award. In 2001, he received the American Society for Engineering 
Education’s (ASEE) ECE Distinguished Educator Award. Dr. Irwin was made an honorary profes- 
sor, Institute for Semiconductors, Chinese Academy of Science, Beijing, China, in 2004. In 2005, he 
received the IEEE Education Society’s Meritorious Service Award, and in 2006, he received the IEEE 
Educational Activities Board Vice President’s Recognition Award. He received the Diplome of Honor 
from the University of Patras, Greece, in 2007, and in 2008 he was awarded the IEEE IES Technical 
Committee on Factory Automation’s Lifetime Achievement Award. In 2010, he was awarded the elec- 
trical and computer engineering department head’s Robert M. Janowiak Outstanding Leadership and 
Service Award. In addition, he is a member of the following honor societies: Sigma Xi, Phi Kappa Phi, 
Tau Beta Pi, Eta Kappa Nu, Pi Mu Epsilon, and Omicron Delta Kappa. 


© 2011 by Taylor and Francis Group, LLC 


Sabeur Abid 


Ecole Superieure Sciences et Techniques Tunis 


University of Tunis 
Tunis, Tunisia 


Filipe Alvelos 

Algoritmi Research Center 

and 

Department of Production and Systems 
University of Minho 

Braga, Portugal 


Christian Blum 

ALBCOM Research Group 
Universitat Politécnica de Catalunya 
Barcelona, Spain 


Oleg Boulanov 

Department of Electrical and Computer 
Engineering 

University of Calgary 

Calgary, Alberta, Canada 


Tak Ming Chan 

Algoritmi Research Center 
University of Minho 
Braga, Portugal 


Mo-Yuen Chow 

Department of Electrical and Computer 
Engineering 

North Carolina State University 

Raleigh, North Carolina 


© 2011 by Taylor and Francis Group, LLC 


Contributors 


Kun Tao Chung 

Department of Electrical and Computer 
Engineering 

Auburn University 

Auburn, Alabama 


Carlos A. Coello Coello 
Departamento de Computacién 


Centro de Investigacion y de Estudios Avanzados 


del Instituto Politécnico Nacional 
Mexico City, Mexico 


Nicholas Cotton 

Panama City Division 

Naval Surface Warfare Centre 
Panama City, Florida 


Mehmet Onder Efe 

Department of Electrical and Electronics 
Engineering 

Bahcesehir University 

Istanbul, Turkey 


Age J. Eide 

Department of Computing Science 
Ostfold University College 
Halden, Norway 


Farhat Fnaiech 


Ecole Superieure Sciences et Techniques Tunis 


University of Tunis 
Tunis, Tunisia 


Nader Fnaiech 


Ecole Superieure Sciences et Techniques Tunis 


University of Tunis 
Tunis, Tunisia 


Xxi 


XXil 


Hani Hagras 

The Computational Intelligence Centre 
University of Essex 

Essex, United Kingdom 


Barrie W. Jervis 

Department of Electrical Engineering 
Sheffield Hallam University 

Sheffield, United Kingdom 


J6zef Korbicz 

Institute of Control and Computation 
Engineering 

University of Zielona Gora 

Zielona Gora, Poland 


Sam Kwong 

Department of Computer Science 
City University of Hong Kong 
Kowloon, Hong Kong 


Thomas Lindblad 

Physics Department 

Royal Institute of Technology 
Stockholm, Sweden 


Manuel Lopez-Ibafiez 
IRIDIA 

Université Libre de Bruxelles 
Brussels, Belgium 


Kim Fung Man 

Department of Electronic Engineering 
City University of Hong Kong 
Kowloon, Hong Kong 


Milos Manic 

Department of Computer Science 
University of Idaho, Idaho Falls 
Idaho Falls, Idaho 


Michael Margaliot 

School of Electrical Engineering 
Tel Aviv University 

Tel Aviv, Israel 


Marcin Mrugalski 

Institute of Control and Computation 
Engineering 

University of Zielona Gora 

Zielona Gora, Poland 


© 2011 by Taylor and Francis Group, LLC 


Contributors 


Andrzej Obuchowicz 

Institute of Control and Computation 
Engineering 

University of Zielona Gora 

Zielona Gora, Poland 


Teresa Orlowska-Kowalska 

Institute of Electrical Machines, Drives 
and Measurements 

Wroclaw University of Technology 

Wroclaw, Poland 


Guy Paillet 
General Vision Inc. 
Petaluma, California 


Witold Pedrycz 

Department of Electrical and Computer 
Engineering 

University of Alberta 

Edmonton, Alberta, Canada 


and 


System Research Institute 
Polish Academy of Sciences 
Warsaw, Poland 


Toannis Pitas 

Department of Informatics 
Aristotle University of Thessaloniki 
Thessaloniki, Greece 


Valeri Rozin 

School of Electrical Engineering 
Tel Aviv University 

Tel Aviv, Israel 


Vlad P. Shmerko 

Electrical and Computer Engineering 
Department 

University of Calgary 

Calgary, Alberta, Canada 


Elsa Silva 

Algoritmi Research Center 
University of Minho 
Braga, Portugal 


Contributors 


Adam Slowik 


Department of Electronics and Computer 


Science 
Koszalin University of Technology 
Koszalin, Poland 


Adrian Stoica 
Jet Propulsion Laboratory 
Pasadena, California 


Krzysztof Szabat 

Institute of Electrical Machines, Drives 
and Measurements 

Wroclaw University of Technology 

Wroclaw, Poland 


Ryszard Tadeusiewicz 

Automatic Control 

AGH University of Science 
and Technology 

Krakow, Poland 


Kit Sang Tang 

Department of Electronic Engineering 
City University of Hong Kong 
Kowloon, Hong Kong 


Anastasios Tefas 

Department of Informatics 
Aristotle University of Thessaloniki 
Thessaloniki, Greece 


J.M. Valério de Carvalho 

Algoritmi Research Center 

and 

Department of Production and Systems 
University of Minho 

Braga, Portugal 


© 2011 by Taylor and Francis Group, LLC 


XXill 


Juyang Weng 

Department of Computer Science and 
Engineering 

Michigan State University 

East Lansing, Michigan 


Paul J. Werbos 

Electrical, Communications and Cyber Systems 
Division 

National Science Foundation 

Arlington, Virginia 


Bogdan M. Wilamowski 

Department of Electrical and Computer 
Engineering 

Auburn University 

Auburn, Alabama 


Tiantian Xie 

Department of Electrical and Computer 
Engineering 

Auburn University 

Auburn, Alabama 


Ronald R. Yager 
Iona College 
New Rochelle, New York 


Svetlana N. Yanushkevich 

Department of Electrical and Computer 
Engineering 

University of Calgary 

Calgary, Alberta, Canada 


Gary Yen 

School of Electrical and Computer Engineering 
Oklahoma State University 

Stillwater, Oklahoma 


Hao Yu 

Department of Electrical and Computer 
Engineering 

Auburn University 

Auburn, Alabama 


Introductions 


1 Introduction to Intelligent Systems Ryszard TadeusiewicZ....ssccscssessessessesseseseesseseenes 1-1 
Introduction ¢ Historical Perspective « Human Knowledge Inside 
the Machine—Expert Systems ¢ Various Approaches to Intelligent Systems « Pattern 
Recognition and Classifications « Fuzzy Sets and Fuzzy Logic » Genetic Algorithms 
and Evolutionary Computing « Evolutionary Computations and Other Biologically 
Inspired Methods for Problem Solving + Intelligent Agents « Other AI Systems of the 
Future: Hybrid Solutions « References 


2 From Backpropagation to Neurocontrol Paul J. Werb0s.... cece eseeseeeeneeseeeneees 2-1 
Listing of Key Types of Tools Available + Historical Background and Larger 
Context + References 


3 Neural Network-Based Control Mehmet Onder Efe ...s-ssssssssssssssssssssesscesssesseesneeeesnnees 3-1 
Background of Neurocontrol « Learning Algorithms + Architectural Varieties « Neural 
Networks for Identification and Control « Neurocontrol Architectures + Application 
Examples « Concluding Remarks « Acknowledgments « References 


4 Fuzzy Logic—Based Control Section Mo-Yuen Chow uu.csscceessessessessesssessessessesseesesssesseens 4-1 
Introduction to Intelligent Control + Brief Description of Fuzzy Logic + Qualitative 
(Linguistic) to Quantitative Description « Fuzzy Operations « Fuzzy Rules, 
Inference « Fuzzy Control « Fuzzy Control Design »* Conclusion and Future 
Direction « References 


IL-1 


© 2011 by Taylor and Francis Group, LLC 


Introduction to 
Intelligent Systems 


LL (erecta asus ainisctasandinncinmenasiaineheninee 1-1 
La  Thistoridal Perspective isciisisicssdsccsscaisdeassontcaceachenneaaahesionseanass 1-2 
1.3. Human Knowledge Inside the Machine—Expert System......... 1-3 
1.4 Various Approaches to Intelligent Systems ..........ceeeseseeseeee 1-4 
1.5 Pattern Recognition and Classifications.......... wee 1-5 
le Posey Sets anal Faery Legit mscamsccnscnanaunems wee L-7 
1.7 Genetic Algorithms and Evolutionary Computing................0+ 1-8 
1.8 Evolutionary Computations and Other Biologically 
Ryszard Inspired Methods for Probleni Soby 11g sess csscssncosscesscnssconiserssesinees 1-9 
Tadeqsiewice We Brie int Ai sso sscd tsaascecsessesticrascueeptacbesaansadianstoiecatetaciaatimious 1-10 
AGH tniversivy of 1.10 Other AI Systems of the Future: Hybrid Solutions ...........0.0.. 1-11 
Science and Technology RIPON EG sts cinccibetitnd ceasing presi ma ieee ios tee ecaeae 1-11 


1.1 Introduction 


Numerous intelligent systems, described and discussed in the subsequent chapters, are based on 
different approaches to machine intelligence problems. The authors of these chapters show the necessity 
of using various methods for building intelligent systems. Almost every particular problem needs an 
individual solution; thus, we can study many different intelligent systems reported in the literature. This 
chapter is a kind of introduction to particular systems and different approaches presented in the book. 
The role of this chapter is to provide the reader with a bird’s eye view of the area of intelligent systems. 
Before we explain what intelligent systems are and why it is worth to study and use them, it is necessary 
to comment on one problem connected with the terminology. 

The problems of equipping artificial systems with intelligent abilities are, in fact, unique. We always 
want to achieve a general goal, which is a better operation of the intelligent system than one, which can 
be accomplished by a system without intelligent components. There are many ways to accomplish this 
goal and, therefore, we have many kinds of artificial intelligent (AI) systems. In general, it should be 
stressed that there are two distinctive groups of researches working in these areas: the Al community 
and the computational intelligence community. The goal of both groups is the same: the need for arti- 
ficial systems powered by intelligence. However, different methods are employed to achieve this goal by 
different communities. 

AI [LS04] researchers focus on the imitation of human thinking methods, discovered by psychology, 
sometimes philosophy and so-called cognitive sciences. The main achievements of AJ are traditionally 
rule-based systems in which computers follow known methods of human thinking and try to achieve 
similar results as human. Mentioned below, and described in detail in a separate chapter, expert systems 
are good examples of this AI approach. 
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Computational intelligence [PM98] researchers focus on the modeling of natural systems, which can 
be considered as intelligent ones. The human brain is definitely the source of intelligence; therefore, this 
area of research focuses first on neural networks, very simplified but efficient in practical application 
models of small parts of the biological brain. There are also other natural processes, which can be used 
(when appropriately modeled) as a source of ideas for successful AI systems. We mention, e.g., swarm 
intelligence models, evolutionary computations, and fuzzy systems. 

The differentiation between AI and computational intelligence (also known as soft computing 
[CM05]) is important for researchers and should be obeyed in scientific papers for its proper clas- 
sification. However, from the point of view of applications in intelligent systems, it can be disre- 
garded. Therefore, in the following sections, we will simply use only one name (artificial intelligence) 
comprising both artificial intelligence and computational intelligence methods. For more precise 
differentiations and for tracing bridges between both approaches mentioned, the reader is referred 
to the book [RT08]. 

The term “artificial intelligence” (AI) is used in a way similar to terms such as mechanics or electron- 
ics but the area of research and applications that belong to AI are not as precisely defined as the other 
areas of computer science. The most popular definitions of AI are always related to the human mind and 
its emerging property: natural intelligence. At times it was fashionable to discuss the general definition 
of Al as follows: Is AI at all possible or not? Almost everybody knows Turing’s answer to this question 
[T48], known as “Turing test,” where a human judge must recognize if his unknown-to-him partner in 
discussion is an intelligent (human) person or not. Many also know Searle’s response to the question, his 
“Chinese room” model [S80]. For more information about these contradictions, the reader is referred to 
the literature listed at the end of this chapter (a small bibliography of AI), as well as to a more compre- 
hensive discussion of this problem in thousands of web pages on the Internet. From our point of view, 
it is sufficient to conclude that the discussion between supporters of “strong AI” and their opponents is 
still open—with all holding their opinions. 

For the readers of this volume, the results of these discussions are not that important since regard- 
less of the results of the philosophical roll outs—“intelligent” systems were built in the past, are used 
contemporarily, and will be constructed in the future. It is because intelligent systems are very useful 
for all, irrespective of their belief in “strong AI” or not. Therefore in this chapter, we do not try to answer 
the fundamental question about the existence of the mind in the machine. We just present some useful 
methods and try to explain how and when they can be used. This detailed knowledge will be presented 
in the next few sections and chapters. At the beginning, let us consider neural networks. 


1.2 Historical Perspective 


This chapter is not meant to be a history of AI because the users are interested in exploiting mature 
systems, algorithms, or technologies, regardless of long and difficult ways of systematic develop- 
ment of particular methods as well as serendipities that were important milestones in AlI’s develop- 
ment. Nevertheless, it is good to know that the oldest systems, solving many problems by means of 
AI methods, were neural networks. This very clever and user-friendly technology is based on the 
modeling of small parts of real neural system (e.g., small pieces of the brain cortex) that are able to 
solve practical problems by means of learning. The neural network theory, architecture, learning, 
and methods of application will be discussed in detail in other chapters; therefore, here we only 
provide a general outline. 

One was mentioned above: neural networks are the oldest AI technology and it is still the leading 
technology if one counts number of practical applications. When the first computers were still large and 
clumsy, the neurocomputing theorists, Warren Sturgis McCulloch and Walter Pitts, published “A logi- 
cal calculus of the ideas immanent in nervous activity” [MP43], thus laying foundations for the field of 
artificial neural networks. This paper is considered as the one that started the entire AI area. Many books 
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written earlier and quoted sometimes as heralds of AI were only theoretical speculations. In contrast, 
the paper just quoted was the first constructive proposition on how to build AI on the basis of mimick- 
ing brain structures. It was a fascinating and breakthrough idea in the area of computer science. 

During many years of development, neural networks became the first working AI systems (Perceptron 
by Frank Rosenblatt, 1957), which was underestimated and it lost “steam” because of the (in)famous 
book by Marvin Minski [MP72], but returned triumphantly as an efficient tool for practical problem 
solving with David Rumelhart’s discovery of backpropagation learning method [RM86]. Since the 
mid-1980s the power and importance of neural networks permanently increased, reaching now a defi- 
nitely leading position in all AI applications. However, its position is somehow weakened because of the 
increase in popularity and importance of other methods belonging to the so-called soft computing. But 
if one has a problem and needs to solve it fast and efficiently—one can still choose neural networks as 
a tool that is easy to use, with lots of good software available. 

The above comments are the reason we pointed out neural networks technology in the title of this 
chapter with the descriptive qualification “first.” From the historical point of view, neural network was 
the first AI tool. From the practical viewpoint, it should be used as the first tool, when practical prob- 
lems need to be solved. It is great chance that the neural network tool you use turns out good enough 
and you do not need any more. Let me give you advice, taken from long years of experience in solving 
hundreds of problems with neural networks applications. There are several types of neural networks 
elaborated on and discovered by hundreds of researchers. But the most simple and yet successful tool 
in most problems is the network called MLP (multilayer perceptron [H98]). If one knows exactly the 
categories and their exemplars, one may use this network with a learning rule such as the conjugate gra- 
dient method. If, on the other hand, one does not know what one is expecting to find in the data because 
no prior knowledge about the data exists, one may use another popular type of neural network, namely, 
the SOM (self-organizing map), also known as Kohonen network [K95], which can learn without the 
teacher. If one has an optimization problem and needs to find the best solution in a complex situation, 
one can use the recursive network, known as the Hopfield network [H82]. Experts and practitioners can 
of course use also other types of neural networks, described in hundreds of books and papers but it will 
bea kind of intellectual adventure, like off-road expedition. Our advice is like signposts pointing toward 
highways; highways are boring but lead straight to the destination. If one must solve a practical problem, 
often there is no time for adventures. 


1.3 Human Knowledge Inside the Machine—Expert Systems 


Neural networks discussed in the previous section, of which detailed descriptions can be found in the 
following chapter, are very useful and are effective tools for building intelligent systems but they have 
one troublesome limitation. There is a huge gap between the knowledge encoded in the neural network 
structure during the learning process, and easy for human understanding knowledge presented in any 
intelligible form (mainly based on symbolic forms and natural language statements). It is very difficult to 
use knowledge that is captured by the neural network during its learning process, although sometimes 
this knowledge is the most valuable part of the whole system (e.g., in forecasting systems, where neural 
networks sometimes are—after learning—a very successful tool, but nobody knows how and why). 

The above-mentioned gap is also present when going in the opposite way, e.g., when we need to 
add man’s knowledge to the AI system. Sometimes (and, in fact, very often) we need to have in an 
automatic intelligent system some part of this knowledge embedded, which can be obtained from the 
human expert. We need to insert this knowledge into an automatic intelligent system because it is 
often easier and cheaper to use a computer program instead of constantly asking humans for expert 
opinion or advice. 

Such design with computer-based shell and human knowledge inside it is known as an expert system 
[GR89]. Such a system can answer the questions not only searching inside internal knowledge represen- 
tation but can also use methods of automatic reasoning for automatic deriving of conclusions needed by 
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the user. The expert system can be very helpful for many purposes, combining the knowledge elements 
extracted from both sources of information: explicit elements of human expert wisdom collected in the 
form of the knowledge base in computer memory, and elements of user knowledge hidden in the system 
and emerged by means of automatic reasoning methods after questioning the system. 

The main difference between the expert systems and neural networks is based on the source and 
form of knowledge, which is used in these two AI tools for practical problem solving. In neural net- 
works, the knowledge is hidden and has no readable form but can be collected automatically on the 
base of examples forming the learning data set. Results given by neural networks can be true and 
very useful but never comprehensible to users, and therefore must be treated with caution. On the 
other hand, in the expert system, everything is transparent and intelligible (most of such systems can 
provide explanations of how and why the particular answer was derived) but the knowledge used by 
the system must be collected by humans (experts themselves or knowledge engineers who interview 
experts), properly formed (knowledge representation is a serious problem), and input into the sys- 
tem’s knowledge base. Moreover, the methods of automatic reasoning and inference rules must be 
constructed by the system designer and must be explicit to be built into the system’s structure. It is 
always difficult to do so and sometimes it is the source of limitations during the system’s development 
and exploitation. 


1.4 Various Approaches to Intelligent Systems 


There are various approaches to intelligent systems but fundamental difference is located in the 
following distinction: the methods under consideration can be described as symbolic versus 
holistic ones. 

In general, the domain of AI (very wide and presented in this chapter only as a small piece) can 
be divided or classified using many criteria. One of the most important divisions of the whole area 
can be based on the difference between the symbolic and holistic (pure numerical) approach. This 
discriminates all AI methods but can be shown and discussed on the basis of only two technologies 
presented here—neural networks and expert systems. Neural networks are technology definitely ded- 
icated toward quantitative (numerical) calculations. Signals on input, output, and, most importantly, 
every element inside the neural network, are in the form of numbers even if their interpretation is of 
a qualitative type. It means that we must convert qualitative information into quantitative represen- 
tation in the network. This problem is out of the scope of this chapter; therefore, we only mention a 
popular way of such a conversion, called “one of N.” The merit of this type of data representation is 
based on spreading one qualitative input to N neurons in the input layer, where N is a number of dis- 
tinguishable quantitative values, which can be observed in a considered data element. For example, if 
a qualitative value under consideration is “country of origin” and if there are four possible countries 
(let us say the United States, Poland, Russia, Germany) we must use for representation of this data 
four neurons with all signals equaling zero, except one input, corresponding to the selected value in 
input data, where the signal is equal 1. In this representation, 0,1,0,0 means Poland, etc. The identical 
method is used for the representation of output signals in neural networks performing a classifica- 
tion task. Output from such a network is in theory singular, because we expect only one answer: 
label of the class to which a classified object belongs given the input of the network at this moment. 
But because the label of the class is not a quantitative value—we must use in the output layer of the 
network as many neurons as there are classes—and the classification process will be assessed as suc- 
cessful when an output neuron attributed to the proper class label will produce a signal much stronger 
than other output neurons. 

Returning to the general categorization of AI methods: qualitative versus quantitative we point out 
expert systems as a typical tool for the processing of qualitative (symbolic) data. The source of power in 
every expert system is its knowledge base, which is constructed from elements of knowledge obtained 
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from human experts. Such elements of knowledge are different from the merit point of view because 
the expert system can be designed for solving different problems. Also, the internal representation of 
the human knowledge in a particular computer system can be different, but always in its symbolic form 
(sometimes even linguistic [natural language sentences]). 

The methods of symbolic manipulations were always very attractive for AI researchers because the 
introspective view of human thinking process is usually registered in a symbolic form (so-called inter- 
nal speech). Thus, in our awareness, almost every active cognitive process is based on symbol manipula- 
tions. Also, from the psychological point of view, the nature of activity of the human mind is defined as 
analytical-synthetical. What is especially emphasized is the connection between thinking and speaking 
(language), as the development of either of these abilities is believed to be impossible to exist in separa- 
tion one from another. 

Therefore “founding fathers” of AI in their early works massively used symbolic manipulations 
as tools for AI problem solving. The well-known example of this stream of works was the system 
named GPS (General Problem Solver) created in 1957 by Herbert Simon and Allen Newell [NS59]. 
It was a famous example, but we stress that a lot of AI systems based on symbolic manipulations 
and applying diverse approaches have been described in the literature. They were dedicated to 
automatic proving of mathematical theorems, playing a variety of games, solving well-formalized 
problems (e.g., Towers of Hanoi problem), planning of robot activities in artificial environments 
(“blocks world”), and many others. Also, early computer languages designed for AI purposes (e.g., 
LISP) were symbolic. 

The differentiation between symbolic manipulations (as in expert systems) and holistic evaluation 
based on numerical data (like in neural networks) is observable in AI technology. It must be taken into 
account by every person who strives for the enhancement of designed or used electronic systems power- 
ing them by AI supplements. 

We note one more surprising circumstance of the above discussed contradiction. Our introspection 
suggests a kind of internal symbolic process, which is accompanied with every metal process inside the 
human brain. At the same time, neural networks that are models of the human brain are not able to use 
symbolic manipulation at all! 

AI methods and tools are used for many purposes but one of the most important areas where AI algo- 
rithms are used with good results is for problems connected with pattern recognition. The need of data 
classification is very popular because if we can classify the data, we can also better understand the infor- 
mation hidden in the data streams and thus can pursue knowledge extraction from the information. 

In fact, to be used in AI automatic classification methods, we must take into account two types of 
problems and two groups of methods used for problem solution. 


1.5 Pattern Recognition and Classifications 


The first one is a classical pattern recognition problem with many typical methods used for its solving. 
At the start of all such methods, we have a collection of data and—as a presumption—a set of pre- 
cisely defined classes. We need a method (formal algorithm or simulated device like neural network) for 
automatic decision making as to which class a particular data point belongs. The problem under con- 
sideration is important from a practical point of view because such classification-based model of data 
mining is one of the most effective tools for discovering the order and internal structure hidden in the 
data. This problem is also interesting from the scientific point of view and often difficult to solve because 
in most pattern recognition tasks, we do not have any prior knowledge about classification rules. The 
relationship between data elements and the classes to which these data should be classified is given only 
in the form of collection of properly classified examples. Therefore, all pattern recognition problems are 
examples of inductive reasoning tasks and need some machine learning approach that is both interest- 
ing and difficult [TK09]. 
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FIGURE 1.1 Pattern recognition problem with supervised learning used. 


Machine learning methods can be divided into two general parts. The first part is based on supervised 
learning while the second part is related to unsupervised learning, also called self-learning or learning 
without teacher. 

An example of supervised learning is presented in Figure 1.1. The learning system (represented by 
computer with learning algorithm inside) receives information about some object (e.g., man’s face). 
The information about the object is introduced through the system input when the teacher guiding the 
supervised learning process prompts proper name of the class, to which this object should be numbered 
among. The proper name of the class is “Man” and this information is memorized in the system. Next 
another object is shown to the system, and for every object, teacher gives additional information, to 
which class this object belongs. After many learning steps, system is ready for exam and then a new 
object (never seen before) is presented. Using the knowledge completed during the learning process, the 
system can recognize unknown objects (e.g., a man). 

In real situations, special database (named learning set) is used instead of human teacher. In such 
database, we have examples of input data as well as proper output information (results of correct recog- 
nition). Nevertheless, the general scheme of supervised learning, shown in Figure 1.2, is fulfilled also in 
this situation. 

Methods used in AI for pattern recognition vary from simple ones, based on naive geometrical intu- 
itions used to split data description space (or data features space) into parts belonging to different classes 
(e.g., k-nearest neighbor algorithm), through methods in which the computer must approximate bor- 
ders between regions of data description space belonging to particular classes (e.g., discriminant func- 
tion methods or SVM algorithms), up to syntactic methods based on structure or linguistics, used for 
description of classified data [DH01]. 

A second group of problems considered in AI and related to the data classification tasks is cluster 
analysis [AB84]. The characteristics of these problems are symmetrical (or dual) to the above-mentioned 
pattern recognition problems. Whereas in pattern recognition we have predefined classes and need a 
method for establishing membership for every particular data point into one of such classes, in cluster 
analysis, we only have the data and we must discover how many groups are in the data. There are many 
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FIGURE 1.2 Classification problem with unsupervised learning used. 
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interesting approaches to solving clustering problems, and this problem can be thought of as the first 
step in building automatic systems capable of knowledge discovery, not only learning [CP07]. 

Let us discuss unsupervised learning scheme used for automatic solving of classification problems. 
During self-learning, the learned algorithm also receives information about features of the objects 
under consideration, but in this case, this input information is not enriched by accompanying informa- 
tion given by the teacher—because teacher is absent. Nevertheless, self-learning algorithm can perform 
classification of the objects using only similarity criteria and next can recognize new objects as belong- 
ing to particular self-defined classes. 


1.6 Fuzzy Sets and Fuzzy Logic 


One of the differences between the human mind and the computer relates to the nature of information/ 
knowledge representation. Computers must have information in precise form, such as numbers, sym- 
bols, words, or even graphs; however, in each case, it must be an exact number, or a precisely selected 
symbol, or a properly expressed word or graph plotted in a precisely specified form, color, and dimen- 
sion. Computers cannot accept a concept such as “integer number around 3,” or “symbol that looks like 
a letter,” etc. In contrast, human minds perform very effective thinking processes that take into account 
imprecise qualitative data (e.g., linguistic terms) but can come up with a good solution, which can be 
expressed sharply and precisely. 

There are many examples showing the difference between mental categories (e.g., “young woman”) 
and precisely computed values (e.g., age of particular people). Definitely, the relation between math- 
ematical evaluation of age and “youngness” as a category cannot be expressed in a precise form. We 
cannot precisely answer the question, at which second of a girl’s life she transforms to a woman, or at 
which exact hour her old age begins. 

In every situation, when we need to implement in an intelligent system a part of human common 
sense, there is a contradiction between human fuzzy/soft thinking and the electronic system’s sharp 
definition of data elements and use of precise algorithms. As is well known, the solution is to use fuzzy 
set and fuzzy logic methods [Z65]. Fuzzy set (e.g., the one we used above, “young woman”) consists of 
the elements that, for sure (according to human experts), belongs to this set (e.g., 18-year-old graduate of 
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high school), and the elements that absolutely do not belong to this fuzzy set (e.g., 80-year-old grandma), 
but take into account the elements that belong to this set only partially. All elements that have degrees 
of membership different from zero belong to this particular fuzzy set. Some of them have membership 
function with values of 1—they belong to the set unconditionally. Elements with membership function 
have values of 0O—they are outside of the set. For all other elements, the value of membership function is 
a real number between 0 and 1. The shape of membership function is defined by human experts (or 
sometimes from available data) but for practical computations, the preferred shapes are either triangular 
or trapezoidal. 

Fuzzy logic formulas can be dually expressed by if ... then ... else ... statements but they are 
expressed by means of fuzzy formulas. It is worth mentioning that fuzzy logic came into being as an 
extension of Lukasiewicz’s multimodal logic [L20]. Details of this approach are described in other 
chapters. 

It is worth mentioning here a gap between rather simple and easy-to-understand key ideas used 
in fuzzy data representation as well as simple fuzzy logic reasoning methods and rather complex 
practical problems solved in AI by means of fuzzy systems. It can be compared to walking in high 
mountains—first we go through a nice flowering meadow but after a while the walk transforms into 
extreme climbing. 

Not all AI researchers like fuzzy methods. A well-known AI expert commented that this approach 
can be seen as “fuzzy theory about fuzzy sets.” But in fact, the advantages of using fuzzy methods are 
evident. Not only the knowledge-based systems (i.e., expert systems) broadly use fuzzy logic and fuzzy 
representation of linguistic terms, but the fuzzy approach is very popular in economic data assessment, 
in medical diagnosis, and in automatic control systems. Moreover, their popularity increases because in 
many situations they are irreplaceable. 


1.7 Genetic Algorithms and Evolutionary Computing 


Figure 1.3 shows an example how the property of face image can be categorized. The face can be wide or 
narrow, can have large or small mouth, and eyes can be close or far. Once these categories are selected, 
each image of a face can be considered as a point in the three-dimensional space, as shown in Figure 1.4. 
Of course, often in the object we can distinguish more than just three properties and this would be a point 


FIGURE 1.3 Example features that can be used for categorization and recognition of faces. 
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FIGURE 1.4 Relation between image of face and point in three-dimensional space. 


in multidimensional space. Fuzzy systems may handle well the classifications when dimensionaliy they 
are limited to three. For problems with larger dimensions, neural networks have a significant advantage. 

While describing neural networks, which are popular AI technology, we stressed their biological 
origin as a crude model of a part of a brain. Thanks to this fact, artificial neural networks exhibit brain- 
like behavior: they can learn and self-organize, generalize, and be used as predictors/classifiers, arrange 
information on the base of auto- and hetero-associative criteria, perform holistic and intuitive analysis 
of complex situations, are robust, etc. On the basis of the neural network example, we show the effec- 
tiveness of translating the biological knowledge into technological applications. Neural networks are 
obviously not a unique example of such biology-to-technology transmission of ideas. Another very 
well-known example is evolutionary computation [M96]. 


1.8 Evolutionary Computations and Other Biologically 
Inspired Methods for Problem Solving 


The biological theory of evolution in many details (especially connected with the origin of humans) is 
still the area of hot discussions but no one questions the existence of evolution as a method of natural 
species optimization. In technology, we also seek for ever-better optimization methods. Existing opti- 
mization algorithms can be divided (freely speaking) into two subgroups. The first subgroup is formed 
by methods based on goal-oriented search (like fastest decrease/increase principle); an example is the 
gradient descent algorithm. The second group is based on random search methods; an example is 
the Monte Carlo method. 

Both approaches to optimization suffer serious disadvantages. Methods based on goal-oriented 
search are fast and efficient in simple cases, but the solution may be wrong because of local minima 
(or maxima) of the criteria function. It is because the search process in all such methods is driven by 
local features of the criterion function, which is not optimal in the global sense. There is no method, 
which can be based on local properties of the optimization functionality and at the same time can 
effectively find the global optimum. On the other hand, the methods that use random searches 
can find proper solutions (optimal globally), but require long computational times. It is because the 
probability ofa global optimum hit is very low and is increased only by means of performing a large 
number of attempts. 
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AI methods based on evolutionary computations combine random searches (because of using 
crossover and mutation) with goal-oriented searches (maximization of the fitness function, which is a 
functional to be optimized). Moreover, the search is performed simultaneously on many parallel paths 
because of several individuals (represented by chromosomes) belonging to every simulated population. 
The main idea of the evolutionary computing is based on defining all parameters and finding their opti- 
mal values. We generate (randomly) initial value for chromosomes (individuals belonging to the initial 
population) and then artificial evolution starts. Details of evolutionary computing are given in other 
chapters. It is worth to remember that evolutionary computing is a more general term than, e.g., genetic 
algorithms. Every user of genetic algorithms is doing evolutionary computing [K92]. 

In the title of this section, we mentioned that there exist other biologically inspired methods for 
problem solving. We note below just two that are very popular. 

The first one is the ant colony optimization method that is used for solving many optimization 
problems, and is based on the ant’s behavior. Like the neural network, it is a very simplified model of a 
part of the human brain, while genetic algorithms work on the basis of evolution, the ant’s calculations 
use simplified model of the social dependences between ants in an ant colony. Every particular ant is a 
primitive organism and its behavior is also primitive and predictable. But the total ant population is 
able to perform very complicated tasks like the building of the complex three-dimensional structure of 
the anthill or finding the most efficient way for transportation of food from the source to the colony. The 
most efficient way can sometimes be equivalent to the shortest path; it takes into account the structure 
of the ground surface for minimizing the total effort necessary for food collection. Intelligence of the ant 
colony is its emerging feature. The source of very clever behavior observed sometimes for the whole ant 
population is located in rather simple rules controlling behavior of each particular ant and also in the 
simple rules governing relations and “communication” between ants. Both elements (e.g., mechanisms 
of single ant activity control as well as communication schemes functioning between ants) are easily 
modeled in the computer. Complex and purposeful behavior of the entire ant population can then be 
converted into an intelligent solution of a particular problem by the computer [CD91]. 

The second (too previously discussed) bio-inspired computational technique used in AI is an artifi- 
cial immune systems methodology. The natural immune system is the strongest anti-intruder system 
that defends living organisms against bacteria, viruses, and other alien elements, which try to pene- 
trate the organism. Natural immune systems can learn and must have memory, which is necessary for 
performing the above-mentioned activities. Artificial immune systems are models of this biological 
system that are able to perform similar activities on computer data, programs, and communication 
processes [CT02]. 


1.9 Intelligent Agents 


Over many years of development of AI algorithms dedicated to solving particular problems, there was a 
big demand (in terms of computer calculation power and in memory). Therefore, programs with adjec- 
tive “intelligent” were hosted on big computers and could not be moved from one computer to the other. 
An example is Deep Blue—a chess-playing computer developed by IBM—which, on May 11, 1997, won 
the chess world championship against Garry Kasparov. 

In contemporary applications, AI, even the most successful, located in one particular place is not 
enough for practical problem solving. The future is distributed AI, ubiquitous intelligence, which can be 
realized by means of intelligent agents. 

Agent technology is now very popular in many computer applications, because it is much easier to 
achieve good performance collaboratively, with limited costs by using many small but smart programs 
(agents) that perform some information gathering or processing task in a distributed computer environ- 
ment working in the background. Typically, a particular agent is given a very small and well-defined 
task. Intelligent cooperation between agents can lead to high performance and high quality of the result- 
ing services for the end users. The most important advantage of such an AI implementation is connected 
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with the fact that the intelligence is distributed across the whole system and is located in these places 
(e.g., Web sites or network nodes) when necessary [A02]. 

AI methods used on the base of agent technology are a bit similar to the ant colony methods described 
above. But an intelligent agent can be designed on the base of neural networks technology, can use elements 
taken from expert systems, can engage pattern recognition methods as well as clustering algorithms. 
Almost every earlier mentioned element of AI can be used in the intelligent agent technology as a 
realization framework. 

The best-known applications of distributed AI implemented as a collection of cooperating but inde- 
pendent agents are in the area of knowledge gathering for Internet search machines. The second area 
of intelligent agent applications is related to spam detection and computer virus elimination tasks. 
Intelligent agent technology is on the rise and possibly will be the dominating form of Al in the future. 


1.10 Other AI Systems of the Future: Hybrid Solutions 


In the previous sections, we tried to describe some “islands” from the “AI archipelago.” Such islands, 
like neural networks, fuzzy sets, or genetic algorithms are different in many aspects: their theoretical 
background, technology used, data representation, methods of problem solving, and so on. However, 
many AI methods are complementary, not competitive. Therefore many modern solutions are based on 
the combination of these approaches and use hybrid structures, combining the best elements taken from 
more than one group of methods for establishing the best solution. In fact, AI elements can be combined 
in any arrangement because they are flexible. The very popular hybrid combinations are listed below: 


« Neuro-fuzzy systems, which are based on fuzzy systems intuitive methodology combined with 
neural networks power of learning 

« Expert systems powered by fuzzy logic methods for conclusion derivations 

¢ Genetic algorithms used for the selection of the best neural network structure 


Hybridization can be extended to other combinations of AI elements that when put together work more 
effectively than when used separately. Known are hybrid constructions combining neural networks 
with other methods used for data classification and pattern recognition. Sometimes, expert systems 
are combined not only with fuzzy logic but also with neural networks, which can collect knowledge 
during its learning process and then put it (after proper transformation) as an additional element in the 
knowledge base-powered expert system. Artificial immune systems can cooperate with cluster analysis 
methods for proper classification of complex data [CA08]. 

Nobody can foretell how AI will develop in the future. Perhaps AI and computational intelligence will 
go toward automatic understanding technologies, developed by the author and described in [TO08]? This 
chapter was meant to provide a general overview of AI and electronic engineering, and enriched with 
this information the reader can hopefully be better suited to find proper tools for specific applications. 
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This chapter provides an overview of the most powerful practical tools developed so far, and under 
development, in the areas which the Engineering Directorate of National Science Foundation (NSF) 
has called “cognitive optimization and prediction” [NSF07]. For engineering purposes, “cognitive 
optimization” refers to optimal decision and control under conditions of great complexity with 
use of parallel distributed computing; however, the chapter will also discuss how these tools com- 
pare with older tools for neurocontrol, which have also been refined and used in many applications 
[MSW90]. “Cognitive prediction” refers to prediction, classification, filtering, or state estimation 
under similar conditions. 

The chapter will begin with a condensed overview of key tools. These tools can be used separately, but 
they have been designed to work together, to allow an integrated solution to a very wide range of pos- 
sible tasks. Just as the brain itself has evolved to be able to “learn to do anything,” these tools are part of 
a unified approach to replicate that ability, and to help us understand the brain itself in more functional 
terms as a useful working system [PW09]. Many of the details and equations are available on the web, 
as you can see in the references. 

The chapter will then discuss the historical background and the larger directions of the field in more 
narrative terms. 


2.1 Listing of Key Types of Tools Available 


2.1.1 Backpropagation 


The original form of backpropagation [PW74,PW05] is a general closed-form method for calculat- 
ing the derivatives of some outcome of interest with respect to all of the inputs and parameters 
to any differentiable complex system. Thus if your system has N inputs, you get this information 
for a cost N times less than traditional differentiation, with an accuracy far greater than meth- 
ods like perturbing the inputs. Any real-time sensor fusion or control system which requires the 


* This chapter does not represent the views of NSF; however, as work performed by a government employee on government 
time, it may be copied freely subject to proper acknowledgment. 


2-1 


© 2011 by Taylor and Francis Group, LLC 


2-2, Intelligent Systems 


use of derivatives can be made much faster and more accurate by using this method. The larger 
that N is, the more important it is to use backpropagation. It is easier to apply backpropagation to 
standardized subroutines like artificial neural networks (ANN) or matrix multipliers than to cus- 
tom models, because standardized “dual” subroutines can be programmed to do the calculations. 
Backpropagation works on input-output mappings, on dynamical systems, and on recurrent as well 
as feedforward systems. 


2.1.2 Efficient Universal Approximation of Nonlinear Functions 


Any general-purpose method for nonlinear control or prediction or pattern recognition must include 
some ability to approximate unknown nonlinear functions. Traditional engineers have often used 
Taylor series or look-up tables (e.g., “gain scheduling”) or radial basis functions for this purpose; how- 
ever, the number of weights or table entries increases exponentially as the number of input variables 
grows. Methods like that can do well if you have only one to three input variables, or if you have a lot 
of input variables whose actual values never leave a certain hyperplane, or a small set of cluster points. 
Beyond that, the growth in the number of parameters increases computational cost, and also increases 
error in estimating those parameters from data or experience. 

By contrast, Andrew Barron of Yale has proven [Barron93] that the well-known multilayer perceptron 
(MLP) neural network can maintain accuracy with more inputs, with complexity rising only as a poly- 
nomial function of the number of inputs, if the function to be approximated is smooth. For nonsmooth 
functions, the simultaneous recurrent network (SRN) offers a more universal Turing-like extension of the 
same capabilities [PW92a,CV09,PW92b]. (Note that the SRN is not at all the same as the “simple recurrent 
network” later discussed by some psychologists.) 

In actuality, even the MLP and SRN start to have difficulty when the number of true independent 
inputs grows larger than 50 or so, as in applications like full-scale streaming of raw video data, or 
assessment of the state of an entire power grid starting from raw data. In order to explain and repli- 
cate the ability of the mammal brain to perform such tasks, a more powerful but complex family of 
network designs has been proposed [PW98al], starting from the cellular SRN (CSRN) and the Object 
Net [PW04,IKW08,PW09]. Reasonably fast learning has been demonstrated for CSRNs in performing 
computational tasks, like learning to navigate arbitrary mazes from sight and like evaluating the con- 
nectivity of an image [IKW08,YC99]. MLPs simply cannot perform these tasks. Simple feedforward 
implementations of Object Nets have generated improvements in Wide-Area Control for electric power 
[QVH07] and in playing chess. Using a feedforward Object Net as a “critic” or “position evaluator,” 
Fogel’s team [Fogel04] developed the world’s first computer system, which learned to play master-class 
chess without having been told specific rules of play by human experts. 

All of these network designs can be trained to minimize square error in predicting a set of desired 
outputs Y(t) from a set of inputs X(t), over a database or stream of examples at different times t. 
Alternatively, they can be trained to minimize a different measure of error, such as square error plus 
a penalty function, or a logistic probability measure if the desired output is a binary variable. Efficient 
training generally requires a combination of backpropagation to get the gradient, plus any of a wide 
variety of common methods for using the gradient. It is also possible to do some scaling as part of the 
backwards propagation of information itself [PW 74], but I am not aware of any work which has followed 
up effectively on that possibility as yet. When the number of weights is small enough, training by evo- 
lutionary computing methods like particle swarm optimization [4VMHH08,CV09] works well enough 
and may be easier to implement with software available today. 

These designs can also be trained using some vector of (gradient) feedback, F_Y(t), to the output of the 
network, in situations where desired outputs are not known. This often happens in control applications. 
In situations where desired outputs and desired gradients are both known, the networks can be trained 
to minimize error in both. (See Gradient Assisted Learning [PW92b].) This can be the most efficient way 
to fit a neural network to approximate a large, expensive modeling code [PW05]. 
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2.1.3 More Powerful and General Decision and Control 


The most powerful and general new methods are adaptive, approximate dynamic programming (ADP) 
and neural model-predictive control (NMPC). The theorems guaranteeing stability for these methods 
require much weaker conditions than the theorems for traditional or neural adaptive control; in practical 
terms, that means that they are much less likely to blow up if your plant does not meet your assump- 
tions exactly [BDJ08, HLA DP, Suykens97,PW98b]. More important, they are optimizing methods, which 
allow you to directly maximize whatever measure of performance you care about, deterministic or 
stochastic, whether it be profit (minus cost), or probability of survival in a challenging environment. 
In several very tough real-world applications, from low-cost manufacturing of carbon-carbon parts 
[WS90], to missile interception [HB98,HBO02,DBD06] to turbogenerator control [VHW03] to automo- 
tive engine control [SKJD09,Prokhorov08], they have demonstrated substantial improvements over the 
best previous systems, which were based on many years of expensive hand-crafted effort. Reinforcement 
learning methods in the ADP family have been used to train anthropomorphic robots to perform dex- 
terous tasks, like playing ice hockey or performing tennis shots, far beyond the capacity of traditional 
human-programmed robots [Schaal06]. 

NMPC is basically just the standard control method called model predictive control or receding hori- 
zon control, using neural networks to represent the model of the plant and/or the controller. In the 
earliest work on NMPC, we used the term “backpropagation through time (BTT) of utility” [MSW90]. 
NMPC is relatively easy to implement. It may be viewed as a simple upgrade of nonlinear adaptive con- 
trol, in which the backpropagation derivative calculations are extended over time in order to improve 
stability and performance across time. NMPC assumes that the model of the plant is correct and exact, 
but in many applications it turns out to be robust with respect to that assumption. The strong stability 
results [Suykens97] follow from known results in robust control for the stability of nonlinear MPC. 
The Prokhorov controller for the Prius hybrid car is based on NMPC. 

In practical terms, many engineers believe that they need to use adaptive control or learning in order 
to cope with common changes in the world, such as changes in friction or mass in the engines or vehi- 
cles they are controlling. In actuality, such changes can be addressed much better and faster by insert- 
ing time-lagged recurrence into the controller (or into the model of the plant, if the controller gets to 
input the outputs of the recurrent neurons in the model). This makes it possible to “learn offline to be 
adaptive online” [PW99]. This is the basis for extensive successful work by Ford in “multistreaming” 
[Ford96,Ford97,Ford02]. The work by Ford in this area under Lee Feldkamp and Ken Marko was 
extremely diverse, as a simple web search will demonstrate. 

ADP is the more general and brain-like approach [HLADP]. It is easier to implement ADP when 
all the components are neural networks or linear systems, because of the need to use backpropaga- 
tion to calculate many derivatives or “sensitivity coefficients”; however, I have provided pseudocode 
for many ADP methods in an abstract way, which allows you to plug in any model of the plant or 
controller—a neural network, a fixed model, an elastic fuzzy logic module [PW93], or whatever you 
prefer [PW92c,PW05]. 

Workers in robust control have discovered that they cannot derive the most robust controller, in the 
general nonlinear case, without “solving a Hamilton—Jacobi-Bellman” equation. This cannot be done 
exactly in the general case. ADP can be seen as a family of numerical methods, which provides the best 
available approximation to solving that problem. In pure robust control, the user trains the controller to 
minimize a cost function which represents the risk of instability and nothing else. But in practical situ- 
ations, the user can pick a cost function or utility function which is a sum of such instability terms plus 
the performance terms which he or she cares about. In communication applications, this may simply 
mean maximizing profit, with a “quality of service payment” term included, to account for the need to 
minimize downtime. 

Some ADP methods assume a model of the plant to be controlled (which may itself be a neural network 
trained concurrently); others do not. Those which do not may be compared to simple trial-and-error 
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approaches, which become slower and slower as the number of variables increases. Researchers who 
have only studied model-free reinforcement learning for discrete variables sometimes say that rein- 
forcement learning is too slow to give us anything like brain-like intelligence; however, model-based 
ADP designs for continuous variables have done much better on larger problems. Balakrishnan has 
reported that the best model-based methods are relatively insensitive to the accuracy of the model, even 
more so than NMPC is. 

One would expect the brain itself to use some kind of hybrid of model-free and model-based methods. 
It needs to use the understanding of cause-and-effect embedded in a model, but it also needs to be fairly 
robust with respect to the limits of that understanding. I am not aware of such optimal hybrids in the 
literature today. 


2.1.4 Time-Lagged Recurrent Networks 


Time-lagged recurrent networks (TLRNs) are useful for prediction, system identification, plant mod- 
eling, filtering, and state estimation. MLPs, SRNs, and other static neural networks provide a way to 
approximate any nonlinear function as Y = f(X, W), where W is a set of parameters or weights. A TLRN 
is any network of that kind, augmented by allowing the network to input the results of its own calcula- 
tions from previous time periods. 

There are many equivalent ways to represent this idea mathematically. Perhaps, the most useful is the oldest 
[PW87a,b]. In this representation, the TLRN is a combination of two functions, f and g, used to calculate: 


Y(t)= f£(Y(t — 1, R(t — 1), X(t- 1), W) (2.1) 


R(t)= 8(Y(t — 1), R(t — 1), X(t- 1), W) (2.2) 


People often say that the vector R is a collection of variables “inside the network,” which the network 
remembers from one time to the next, as these equations suggest. However, if this TLRN is trained to 
predict a motor to be controlled, then it may be important to send the vector R to the controller as well. 
If the network is well-trained, then the combination of Y and R together represents the state vector of the 
motor being observed. More precisely, in the stochastic case, it is a compact optimized representation 
of the “belief state” of what we know about the state of the motor. Access to the full belief state is often 
essential to good performance in real-world applications. Neural network control of any kind can usually 
be improved considerably by including it. 

Feldkamp and Prokhorov have done a three-way comparison of TLRNs, extended Kalman filters 
(EKF) and particle filters in estimating the true state vectors of a partially observed automotive system 
[Ford03]. They found that TLRNs performed about the same as particle filters, but far better than EKF. 
TLRNs were much less expensive to run than particle filters complex enough to match their perfor- 
mance. (The vector R represents the full belief state, because the full belief state is needed in order to 
minimize the error in the updates; the network is trained to minimize that error.) 

Ironically, the Ford group used EKF training to train their TLRN. In other words, they used back- 
propagation to calculate the derivatives of square error with respect to the weights, and then inserted 
those derivatives into a kind of EKF system to adapt the weights. This is also the only viable approach 
now available on conventional computers (other than brute force evolutionary computing) to train cel- 
lular SRNs [IK W08]. 

TLRNs have also been very successful, under a variety of names, in many other time-series predic- 
tion applications. Mo-Yuen Chow has reported excellent results in diagnostics of motors [MChow93] 
and of their components [MChow00]. Years ago, in a performance test funded by American Airlines, 
BehavHeuristics found that ordinary neural networks would sometimes do better than standard uni- 
variate time-series models like ARMA(p,q) [BJ70], but sometimes would do worse; however, TLRNs 
could do better consistently, because Equations 2.1 and 2.2 are a universal way to approximate what 
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statisticians call multivariate NARMAX models, a more general and more powerful family of models. 
Likewise, in the forecasting competition at the International Joint Conference on Neural Networks 2007 
(IJCNNO7), hard-working teams of statistician students performed much better than hard-working 
teams of neural network students, but a researcher from Ford outperformed them all, with relatively 
little effort, by using their standard in-house package for training TLRNs. 

At IJCNNO7, there was also a special meeting of the Alternative Energy Task Force of the IEEE 
Computational Intelligence Society. At that meeting, engineers from the auto industry and electric power 
sector all agreed that the one thing they need most from universities is the training of students who are 
fully competent in the use of TLRNs. (ADP was the next most important.) 

For a student textbook building up to the use of TLRNs with accompanying software, see [PEL00]. 

In practical applications today, TLRNs are usually trained to minimize square error in prediction. 
However, in applications in the chemical industry, it has been found that “pure robust training” com- 
monly cuts prediction errors by a factor of three. More research is needed to develop an optimal hybrid 
between pure robust training and ordinary least squares [PW98b]. 

The TLRN and other neural networks provide a kind of global prediction model f. But in some pat- 
tern classification applications, it is often useful to make predictions based on what was seen in the clos- 
est past example; this is called precedent-based or memory-based forecasting. Most “kernel methods” 
in use today are a variation of that approach. For full brain-like performance in real-time learning, it 
is essential to combine memory-based capabilities with global generalization, and to use both in adapt- 
ing both; in other words “generalize but remember.” I have discussed this approach in general terms 
[PW92a]; however, in working implementations, the closest work done so far is the work by Atkeson on 
memory-based learning [AS95] and the part of the work by Principe which applies information theo- 
retic learning (related to kernel methods) to the residuals of a global model [PJX00,EP06]. Clustering 
and associative memory can play an important role in the memory of such hybrids. 


2.1.5 Massively Parallel Chips Like Cellular Neural Network Chips 


When NSF set up a research program in neuroengineering in 1988, we defined an ANN as any general- 
purpose design (algorithm/architecture) which can take full advantage of massively parallel computing 
hardware. We reached out to researchers from all branches of engineering and computer science willing 
to face up squarely to this challenge. 

Asa result, all of these tools were designed to be compatible with a new generation of computer chips, so 
that they can provide real-time applications much faster and cheaper than traditional algorithms of the same 
level of complexity. (Neural network approximation also allows models and controllers of reduced com- 
plexity.) For example, a group at Oak Ridge learned about “backpropagation” and renamed it the “second 
adjoint method” [PW05]. Engineers like Robert Newcomb then built some chips, which included “adjoint 
circuits” to calculate derivatives through local calculations on-board a specialty chip. Chua’s group [YC99] 
has shown in detail how the calculations of backpropagation through time map into a kind of cellular 
neural network (CNN) chip, allowing thousands of times acceleration in performing the same calculation. 

From 1988 to about 2000, there were few practical applications which took real advantage of this 
capability. At one time, the Jet Propulsion Laboratory announced a major agreement between Mosaix 
LLC and Ford to use a new neural network chip, suitable for implementing Ford’s large TLRN diag- 
nostic and control systems; however, as the standard processors on-board cars grew faster and more 
powerful, there was less and less need to add anything extra. Throughout this period, Moore’s law made 
it hard to justify the use of new chips. 

In recent years, the situation has changed. The speed of processor chips has stalled, at least for now. New 
progress now mainly depends on being able to use more and more processors on a single chip, and on getting 
more and more general functionality out of systems with more and more processors. That is exactly the chal- 
lenge which ANN research has focused on for decades now. Any engineering task which can be formulated 
as a task in prediction or control can now take full advantage of these new chips, by use of ANN designs. 
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CNN now provide the best practical access to these kinds of capabilities. CNNs have been produced 
with thousands of processors per chip. With new memristor technology and new ideas in nanoelectron- 
ics and nanophotonics, it now seems certain that we can raise this to millions of processors per chip or 
more. A new center (CLION) was created in 2009 at the FedEx Institute of Technology, in Memphis, 
under Robert Kozma and myself, which plans to streamline and improve the pipeline from tasks in 
optimization and prediction to CNN implementations based on ANN tools. 


2.2 Historical Background and Larger Context 


The neural network field has many important historical roots going back to people like Von Neumann, 
Hebb, Grossberg, Widrow, and many others. This section will focus on those aspects most important to 
the engineer interested in applying such tools today. 

For many decades, neural network researchers have worked to “build a brain,” as the Riken Institute 
of Japan has put it. How can we build integrated intelligent systems, which capture the brain’s ability to 
learn to “do anything,” through some kind of universal learning ability? 

Figure 2.1 reminds us of some important realities that specialized researchers often forget, as they 
“miss the forest for the trees.” 

The brain, as a whole system, is an information-processing system. As an information-processing 
system, its entire function as a whole system is to calculate its outputs. Its outputs are actions—actions 
like moving muscles or glandular secretions. (Biologists sometimes call this “squeezing or squirting.”) 
The brain has many important subsystems to perform tasks like pattern recognition, prediction and 
memory, among others; however, these are all internal subsystems, which can be fully understood based 
on what they contribute to the overall function of the entire system. Leaving aside the more specialized 
preprocessors and the sources of primary reinforcement, the larger challenge we face is very focused: 
how can we build a general-purpose intelligent controller, which has all the flexibility and learning abili- 
ties of this one, based on parallel distributed hardware? That includes the development of the required 
subsystems—but they are just part of the larger challenge here. 

In the 1960s, researchers like Marvin Minsky proposed that we could build a universal intelligent 
controller by developing general-purpose reinforcement learning systems (RLS), as illustrated in 
Figure 2.2. 

We may think of an RLS as a kind of black box controller. You hook it up to all the available sensors 
(X) and actuators (u), and you also hook it up to some kind of performance monitoring system which 
gives real-time feedback (U(t)) on how well it is doing. The system then learns to maximize performance 
over time. In order to get the results you really want from this system, you have to decide on what you 
really want the system to accomplish; that means that you must translate your goals into a kind of met- 
ric performance or “cardinal utility function” U [JVN53]. Experts in business decision making have 
developed very extensive guidelines and training to help users to translate what they want into a utility 
function; see [Raiffa68] for an introduction to that large literature. 


Reinforcement 


Sensory input Action 


FIGURE 2.1 Brain as a whole system is an intelligent controller. (Adapted from NIH.) 
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“Utility” or “reward” 
or “reinforcement” 


Sensor inputs Actions 


FIGURE2.2 Reinforcement learning systems. 


R(t+1) 


i4-------| 


FIGURE 2.3 First general-purpose ADP system. (From 1971-72 Harvard thesis proposal.) 


The earlier work on RLS was a great disappointment to researchers like Minsky. Trial-and-error 
methods developed on the basis of intuition were unable to manage even a few input variables well in 
simulation. In order to solve this problem, I went back to mathematical foundations, and developed the 
first reinforcement learning system based on adaptive, ADP, illustrated in Figure 2.3. 

The idea was to train all three component networks in parallel, based on backpropagation feedback. 
There were actually three different streams of derivatives being computed here—derivatives of J(t+1) 
with respect to weights in the Action network or controller; derivatives of prediction error, in the Model 
network; and derivatives of a measure of error in satisfying the Bellman equation of dynamic program- 
ming, to train the critic. The dashed lines here show the flow of backpropagation used to train the Action 
network. Equations and pseudocode for the entire design, and more sophisticated relatives, may be 
found in [PW92a,b,c]. More recent work in these directions is reviewed in [HLADP], and in many recent 
papers in neural network conferences. 

There is a strong overlap between reinforcement learning and ADP, but they are not the same. ADP 
does not include reinforcement learning methods which fail to approximate Bellman’s equation or some 
other condition for optimal decision making across time, with foresight, allowing for the possibility of 
random disturbance. ADP assumes that we (may) know the utility function U(X) itself (or even a recur- 
rent utility function), instead of just a current reward signal; with systems like the brain, performance 
is improved enormously by exploiting our knowledge that U is based on a variety of variables, which we 
can learn about directly. 

All of the ADP designs in [PW92c] are examples of what I now call vector intelligence. I call them 
“vector intelligence” because the input vector X, the action vector u and the recurrent state information 
R are all treated like vectors. They are treated as collections of independent variables. Also, the upper 
part of the brain was assumed to be designed around a fixed common sampling time, about 100-200 ms. 
There was good reason to hope that the complexities of higher intelligent could be the emergent, learned 
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From vector to mammal 
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learns master class chess 
(Fogel Proc. IEEE 2004) 


R(t+1) Add 
X(t) creativity 
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ability 
to make 
0. Vector decisions 


intelligence 
FIGURE 2.4 Levels of intelligence from vector to mammal. 


result of such a simple underlying learning system [PW87a,PW09]. All of these systems are truly intel- 
ligent systems, in that they should always converge to the optimal strategy of behavior, given enough 
learning time and enough computing capacity. But how quickly? It is possible to add new features, still 
consistent with general-purpose intelligence, which make it easier for the system to learn to cope with 
spatial complexity, complexity across time (multiple time intervals), and to escape local minima. This 
leads to a new view illustrated in Figure 2.4. 

In 1998 [PW98], I developed mathematical approaches to move us forward all the way from vector 
intelligence to mammal-level intelligence. However, as a practical matter in engineering research, we 
will probably have to master the first of these steps much more completely before we are ready to make 
more serious progress in the next two steps. 

See [PW09] for more details on this larger picture, and for thoughts about levels of intelligence beyond 
the basic mammal brain. 
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3.1 Background of Neurocontrol 


The mysterious nature of human brain with billions of neurons and enormously complicated biologi- 
cal structure has been a motivation for many disciplines that seem radically different from each other 
at a first glance, for example, engineering and medicine. Despite this difference, both engineering and 
medical sciences have a common base when the brain research is the matter of discussion. From a 
microscopic point of view, determining the building blocks as well as the functionality of those elements 
is one critical issue, and from a macroscopic viewpoint, discovering the functionality of groups of such 
elements is another one. The research in both scales has resulted in many useful models and algorithms, 
which are used frequently today. The framework of artificial neural networks has established an elegant 
bridge between problems, which display uncertainties, impreciseness with noise and modeling mis- 
matches, and the solutions requiring precision, robustness, adaptability, and data-centeredness. The dis- 
cipline of control engineering with tools offered by artificial neural networks have stipulated a synergy 
the outcomes of which is distributed over an enormously wide range. 

A closer look at the historical developments in the neural networks research dates back to 1943. The 
first neuron model by Warren McCulloch and Walter Pitts was postulated and the model is assumed 
to fire under certain circumstances, [MP43]. Philosophically, the analytical models used today are the 
variants of this first model. The book entitled The Organization of Behaviour by Donald Hebb in 1949 
was another milestone mentioning the synaptic modification for the first time [H49]. In 1956, Albert Uttley 
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reported the classification of simple sets containing binary patterns, [U56], while in 1958 the commu- 
nity was introduced to the perceptron by Frank Rosenblatt. In 1962, Rosenblatt postulated several learn- 
ing algorithms for the perceptron model capable of distinguishing binary classes, [R59], and another 
milestone came in 1960: Least mean squares for the adaptive linear element (ADALINE) by Widrow 
and Hoff [WH60]. Many works were reported after this and in 1982, John J. Hopfield proposed a neural 
model that is capable of storing limited information and retrieving it correctly with partially true ini- 
tial state [H82]. The next breakthrough resulting in the resurgence of neural networks research is the 
discovery of error backpropagation technique [RHW86]. Although gradient descent was a known tech- 
nique of numerical analysis, its application was formulated for feedforward neural networks by David 
E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams in 1986. A radically different viewpoint for 
activation scheme, the radial basis functions, was proposed by D.S. Broomhead and D. Lowe, in 1988. 
This approach opened a new horizon particularly in applications requiring clustering of raw data. As the 
models and alternatives enriched, it became important to prove the universal approximation proper- 
ties associated with each model. Three works published in 1989, by Ken-Ichi Funahashi, Kurt Hornik, 
and George Cybenko, proved that the multilayer feedforward networks are universal approximators 
performing the superpositions of sigmoidal functions to approximate a given map with finite precision 
[HSV89,F89,C89]. 

The history of neural network-based control covers mainly the research reported since the discovery 
of error backpropagation in 1982. Paul J. Werbos reported the use of neural networks with backpropa- 
gation utility in dynamic system inversion, and these have become the first results drawing the interest 
of control community to neural network-based applications. The work of Kawato et al. and the book 
by Antsaklis et al. are the accelerating works in the area as they describe the building blocks of neural 
network-based control [KFS87,APW9]]. The pioneering work of Narendra and Parthasarathy has been 
an inspiration for many researchers studying neural network-based control, or neurocontrol [NP91]. 
Four system types, the clear definition of the role of neural network in a feedback control system, and 
the given examples of Narendra and Parthasarathy have been used as benchmarking for many research- 
ers claiming novel methods. Since 1990 to date, a significant increase in the number of neural network 
papers has been observed. According to Science Direct and IEEE databases, a list showing the number of 
published items containing the words neural and control is given in Table 3.1, where the growing interest 
to neural control can be seen clearly. 

In [PF94], decoupled extended Kalman filter algorithm was implemented for the training of recurrent 
neural networks. The justification of the proposed scheme was achieved on a cart-pole system, a biore- 
actor control problem, and on an idle speed control of an engine. Polycarpou reports a stable, adaptive 
neurocontrol scheme for a class of nonlinear systems, and demonstrates the stability using Lyapunov 
theorems [M96]. In 1996, Narendra considers neural network-based control of systems having different 
types of uncertainties, for example, the mathematical details embodying the plant dynamics are not 
known, or their structures are known but parameters are unavailable [N96]. Robotics has been a major 
implementation area for neural network controllers. In [LJY97], rigid manipulator dynamics is stud- 
ied with an augmented tuning law to ensure stability and tracking performance. Removal of certainty 
equivalence and the removal of persistent excitation conditions are important contributions of the cited 
work. Wai uses the neural controller as an auxiliary tool for improving the tracking performance of 
a two-axis robot containing gravitational effects [W03]. Another field of research benefiting from the 
possibilities offered by neural networks framework is chemical process engineering. In [H99,EAK99], 
a categorization of schemes under titles predictive control, inverse model-based control, and adaptive 
control methods are presented. Calise et al. present an adaptive output feedback control approach 
utilizing a neural network [CHI01] while [WHO01] focuses on enhancing the qualities of output regula- 
tion in nonlinear systems. Padhi et al. make use of the neural network-based control in distributed 
parameter systems [PBRO1]. Use of neural network-based adaptive controllers in which the role of neu- 
ral network is to provide nonlinear functions is a common approach reported several times in the litera- 
ture [AB01,GY01,GW02,GW04,HGL05,PK M09]. Selmic and Lewis report the compensation of backlash 
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TABLE 3.1 Number of Papers Containing 
the Keywords Neural and Control between 
1990 and 2008 


Year Items in Science Direct Items in IEEE 


2008 988 1355 
2007 820 1154 
2006 732 1254 
2005 763 843 
2004 644 869 
2003 685 737 
2002 559 820 
2001 565 670 
2000 564 703 
1999 456 749 
1998 491 667 
1997 516 737 
1996 489 722 
1995 380 723 
1994 326 731 
1993 290 639 
1992 234 409 
1991 187 408 
1990 176 270 


with neural network-based dynamic inversion exploiting Hebbian tuning [SLO1], Li presents a radial 
basis function neural network-based controller acting on a fighter aircraft [LSS01]. Chen and Narendra 
present a comparative study demonstrating the usefulness of a neural controller assisted by a linear 
control term [CNO1]. A switching logic is designed and it is shown that neural network-based control 
scheme outperforms the pure linear and pure nonlinear versions of the feedback control law. Another 
work considering the multiple models activated via switching relaxes the condition of global bounded- 
ness of high-order nonlinear terms and improves the neurocontrol approach of Chen and Narendra 
[FC07]. An application of differential neural network-based control to nonlinear stochastic systems is 
discussed by Poznyak and Ljung [PL01]; nonlinear output regulation via recurrent neural networks is 
elaborated by [ZW01], and estimation of a Lyapunov function is performed for stable adaptive neuro- 
control in [R01]. Predictive control approach has been another field of research that used the tools of 
neural networks framework. In [WWO1], a particular neural structure is proposed, and this scheme is 
used to model predictive control. Yu and Gomm consider the multivariable model predictive control 
strategy on a chemical reactor model [YG03]. Applications of neurocontrol techniques in discrete-event 
automata is presented in [PSO1], in reinforcement learning is reported in [SWO1], and those in large- 
scale traffic network management is handled [CSC06]. Unmanned systems are another field that benefit 
from the neural network-based approaches. Due to the varying operating conditions and difficulty of 
constructing necessary process models, numerical data-based approaches become preferable as in the 
study reported in [K06], where a neural controller helps a proportional plus derivative controller and 
handles the time-dependent variations as its structure enables adaptation. This makes it sure that a 
certain set of performance criteria are met simultaneously. In [HCLO7], a learning algorithm of pro- 
portional-integral-derivative type is proposed and a tracking control example is given in second-order 
chaotic system. Prokhorov suggests methods to train recurrent neural models for real-time applications 
[P07], and Papadimitropoulos et al. implement a fault-detection scheme based on an online approxima- 
tion via neural networks [PRP07]. Two of the successful real-time results on spark ignition engines are 
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reported in [CCBC07,VSKJ07], where neural network models are used as internal model controller in 
[CCBC07] and as observer and controller in [VSKJ07]. In [YQLRO7], tabu search is adapted for neural net- 
work training to overcome the problem of convergence to local minimum. Alanis et al. [ASLO7] propose 
high-order neural networks used with backstepping control technique. Discrete time output regulation 
using neural networks is considered in [LH07], and use of neurocontrollers in robust Markov games is 
studied by [SG07]. Recently, radial basis function neural networks were employed for the control of a non- 
holonomic mobile robot in [BFC09] and fine-tuning issues in large-scale systems have been addressed in 
[KK09]. Another recent work by Ferrari reports an application of dynamic neural network—forced implicit 
model following on a tailfin-controlled missile problem [F09]. Direct adaptive optimal control via neural 
network tools is considered in [VL09], model predictive control for a steel pickling process is studied in 
[KTHD09], and issues in the sampled data adaptive control are elaborated in [P09]. 

In brief, in the 1980s, the first steps and models were introduced while the 1990s were the years of stip- 
ulating the full diversity of learning schemes, architectures, and variants. According to the cited volume 
of research, outcomes of this century are more focused on applications and integration to other modules 
of feedback systems. The approaches seem to incorporate the full power of computing facilities as well 
as small-size and versatile data acquisition hardware. Since the introduction of the McCulloch-Pitts 
neuron model, it can be claimed that parallel to the technological innovations in computing hardware, 
progress in neural network-based control systems seem to spread over wider fields of application. In what 
follows, the learning algorithms and architectural possibilities are discussed. The methods of neural 
network-based control are presented and an application example is given. 


3.2 Learning Algorithms 


A critically important component of neural network research is the way in which the parameters are 
tuned to meet a predefined set of performance criteria. Despite the presence of a number of learning 
algorithms, two of them have become standard methods in engineering applications, and are elaborated 
in the sequel. 


3.2.1 Error Backpropagation 


Consider the feedforward neural network structure shown in Figure 3.1. The structure is called feedfor- 
ward as the flow of information has a one-way nature. In order to describe the parameter modification 
rule, a variable sub- and superscripting convention needs to be adopted. In Figure 3.2, the layers in a 
given network are labeled, and the layer number is contained as a superscript. The synaptic weight in 
between the node i in layer k + 1 and node j in layer k is denoted by wj. Let an input vector and output 
vector at time f, be defined as I, = (u,(ty)u2(to)..-Um(to)) and Op = (y,(to)y2(to)---Yn(to)), respectively. An 
input-output pair, shortly a pair or sample, is defined as S, = {I,, Oo}. Consider there are P pairs in a 
given data set, which we call training set. When the training set is given, one must interpret it as fol- 
lows: When I, is presented to a neural network of appropriate dimensions, its response must be an 


FIGURE 3.1 Structure of a feedforward neural network. 
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FIGURE 3.2 Sub- and superscripting of the variables in a feedforward neural network. 


approximate of O). Based on this, the response of a neural network to a set of input vectors can be evalu- 
ated and the total cost over the given set can be defined as follows: 


Hw=— >, ny (a? — yPU,.w)) 3.1) 


where d? is the target output corresponding to the ith output of the neural network that responds to 
pth pattern. The cost in (3.1) is also called the mean squared error (MSE) measure. Similarly, y? is the 
response of the neural network to J, In (3.1), the generic symbol w stands for the set of all adjustable 
parameters, that is, the synaptic strengths, or weights. 

Let the output of a neuron in the network be denoted by o/*'. This neuron can belong to the output 
layer or a hidden layer of the network. The dependence of 0/*' to the adjustable parameters is as given 


nk 
below. Define S/*1: = > wo; as the net sum determining the activation level of the neuron: 
jl 


oft! _ f(s ) (3.2) 


where 
f() is the neuronal activation function 
n, is the number of neurons in the kth layer 


Gradient descent prescribes the following parameter update rule: 


dJ(w) 
wi(tt+D=wi(t)—n (3.3) 
: : dwi(t) 
where 
7) is the learning rate chosen to satisfy0 <1] <1 
the index t emphasizes the iterative nature of the scheme 
Defining Aw; (t)= Wh (t+1)- Wh (t), one could reformulate the above law as 
dJ(w) 
Awj(t) == 3.4 
i=) G.4) 


which is known also as the MIT rule, steepest descent, or the gradient descent, all referring to the above 
modification scheme. 
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3.2.1.1 Adaptation Law for the Output Layer Weights 


Let (k + 1)th layer be the output layer. This means y; =0/*', and the target value for this output is speci- 
fied explicitly by d,. For the pth pair, define the output error at the ith output as 


ef =d? — y?(I,,w) (3.5) 


After evaluating the partial derivative in (3.4), the updating of the output layer weights is performed 
according to the formula given in (3.6), where the pattern index is dropped for simplicity. 


Aw} =118!*'o; (3.6) 
where 8;"" for the output layer is defined as follows: 


af! =e,f (Si) (3.7) 


where f’(S\*') = af/0S\*". One could check the above rule from Figure 3.2 to see the dependencies 
among the variables involved. 


3.2.1.2 Adaptation Law for the Hidden Layer Weights 


Though not as straightforward to show as that for the output layer, the weights in the hidden layer need 
an extra step as there are many paths through which the output errors can be backpropagated. In other 
words, according to Figures 3.1 and 3.2, it is clear that a small perturbation in w} will change the entire 
set of network outputs causing a change in the value of J. Now we consider (k + 1)th layer as the hidden 
layer (see Figure 3.3). The general expression given by (3.6) is still valid and after appropriate manipula- 
tions, we have 


Bt! = » arnt sh (3.8) 
h=1 


It is straightforward to see that the approach presented is still applicable if there are more than one hid- 
den layers. In such cases, output errors are backpropagated until the necessary values shown in (3.8) are 
obtained. 


kth (k+1)th = (k+2)th 
layer layer layer 


FIGURE 3.3 Influence of the output errors on a specific hidden layer weight. 
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Few modifications to the update law in (3.6) were proposed to speed up the convergence. Error back- 
propagation scheme starts fast, yet as time passes, it gradually slows down particularly in the regions 
where the gradient is small in magnitude. Adding a momentum term or modifying the learning rate are 
the standard approaches whose improving effect were shown empirically [H94]. 


3.2.2 Second-Order Methods 


Error backpropagation was used successfully in many applications; however, its very slow convergence 
speed has been the major drawback highlighted many times. Being the major source of motivation, 
increasing the convergence speed has been achieved for feedforward neural networks in 1994 [HM94]. 
Hagan and Menhaj applied the Marquardt algorithm to update the synaptic weights of a feedforward 
neural network structure. The essence of the algorithm is as follows. 

Consider a neural network having n outputs, and N adjustable parameters denoted by the vector 
w= (Ww, W,... Wy). Each entry of the parameter vector corresponds to a unique parameter in an ordered 
fashion. If there are P pairs over which the interpolation is to be performed, the cost function qualifying 
the performance is as given in (3.1) and the update law is given in (3.9). 


Wir =W, —(V3I0W,)) Vuln) (3.9) 


where f stands for the discrete time index. Here, V2,J(w,) =2H(w,)' H(w,)+ g(H(w,)) with g(H(w,)) 
being a small residual, and V,,J(w,) = 2H(w,)"E(w,) with E and H being the error vector and the Jacobian, 
as given in (3.10) and (3.11), respectively. The error vector contains the errors computed by (3.5) for every 
training pair, and the Jacobian contains the partial derivatives of each component of E with respect to 
every parameter in w. 


E=(ei we el ep we G2 et Sek ghee “ek ) (3.10) 
dei(w) de; (w) de;(w) 
dW, dW,  8@n 
dex(w) — de3(w) de3(w) 
dW, dW, “ O@n 
de,(w) de, (w) de,(w) 
80, 0M, Cy 
H(w)= : : : (3.11) 
det (w) de; (w) det (w) 
dW, dW, “ 0@n 
dex(w) de> (w) de> (w) 
dW, dW, “ 0@n 
den(w) den (w) de, (W) 
00, 00> “8 
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Based on these definitions, the well-known Gauss—Newton algorithm can be given as 


Waar =W; —(H(vs)" HOw,)) Hons)" EGov.) (3.12) 


and the Levenberg-Marquardt update can be constructed as 
el 
Wes = — (WI + H(w,)" H(w,)) HO.) E(w.) (3.13) 


where 
ll > 0 is a user-defined scalar design parameter for improving the rank deficiency problem of the 
matrix H(w,)"H(w,) 
Tis an identity matrix of dimensions N x N 


It is important to note that for small p, (3.13) approximates to the standard Gauss-Newton method (see 
(3.12)), and for large 1, the tuning law becomes the standard error backpropagation algorithm with a 
step size 1 ~ 1/u. Therefore, Levenberg—Marquardt method establishes a good balance between error 
backpropagation and Gauss—Newton strategies and inherits the prominent features of both algorithms 
in eliminating the rank deficiency problem with improved convergence. Despite such remarkably good 
properties, the algorithm in (3.13) requires the inversion of a matrix of dimensions nP x N indicat- 
ing high computational intensity. Other variants of second-order methods differ in some nuances yet 
in essence, they implement the same philosophy. Conjugate gradient method with Polak—Ribiere and 
Fletcher-Reeves formulas are some variants used in the literature [H94]. Nevertheless, the problem of 
getting trapped to local minima continues to exist, and the design of novel adaptation laws is an active 
research topic within the realm of neural networks. 


3.2.3 Other Alternatives 


Typical complaints in the application domain have been to observe very slow convergence, convergence 
to suboptimal solutions rendering the network incapable of performing the desired task, oversensitiv- 
ity to suddenly changing inputs due to the gradient computation, and the like. Persisting nature of 
such difficulties has led the researchers to develop alternative tuning laws alleviating some of these 
drawbacks or introducing some positive qualities. Methods inspired from variable structure control 
are one such class showing that the error can be cast into a phase space and guided toward the origin 
while displaying the robustness properties of the underlying technique [YEK02,PMB98]. Another is 
based on derivative-free adaptation utilizing the genetic algorithms [SG00]. Such algorithms refine 
the weight vector based on the maximization of a fitness function. In spite of their computational 
burden, methods that do not utilize the gradient information are robust against the sudden changes 
in the inputs. Aside from these, unsupervised learning methods constitute another alternative used in 
the literature. Among a number of alternatives, reinforcement learning is one remarkable approach 
employing a reward and penalty scheme to achieve a particular goal. The process of reward and penalty 
is an evaluative feedback that characterizes how the weights of a neural structure should be modified 
[SW01,MD05,N97]. 


3.3 Architectural Varieties 


Another source of diversity in neural network applications is the architecture. We will present the alter- 
natives under two categories: first is the type of connectivity, second is the scheme adopted for neuronal 
activation. In both cases, numerous alternatives are available as summarized below. 
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3.3.1 Structure 


Several types of neural network models are used frequently in the literature. Different structural con- 
figurations are distinguished in terms of the data flow properties of a network, that is, feedforward mod- 
els and recurrent models. In Figure 3.1, a typical feedforward neural network is shown. Three simple 
versions of recurrent connectivity are illustrated in Figure 3.4. The network in Figure 3.4a has recurrent 
neurons in the hidden layer, Figure 3.4b feeds back the network output and established recurrence exter- 
nally, and Figure 3.4c contains fully recurrent hidden layer neurons. 

Clearly, one can set models having multiple outputs as well as feedback connections in between dif- 
ferent layers but due to the space limit, we omit those cases and simply consider the function of a neuron 
having feedback from other sources of information as in Equation 3.2 but now we have the net sum as 
follows: 


R 


nk 
Si =} whos + Sw G.14) 
j=l 


i=l 


where 
the second sum is over all feedback connections 
p; denotes the weight determining the contribution of the output €, from a neuron in a layer 


Clearly, the proper application of error backpropagation or Levenberg—Marquardt algorithm is depen- 
dent upon the proper handling of the network description under feedback paths [H94]. 


3.3.2 Neuronal Activation Scheme 


The efforts toward obtaining the best performing artificial neural model has also focused on the neuro- 
nal activation scheme. The map built by a neural model is strictly dependent upon the activation func- 
tion. Smooth activation schemes produce smooth hypersurfaces while sharp ones like the sign function 
produce hypersurfaces having very steep regions. This subsection focuses on the dependence of perfor- 
mance on the type of neuronal activation function. In the past, this was considered to some extent in 
[KA02,HN94,WZD97,CF92,E08]. Efe considers eight different data sets, eight different activation func- 
tions with networks having 14 different neuron numbers in the hidden layer. Under these conditions, 
the research conducted gives a clear idea about when an activation function is advisable. The goal of 
such a research is to figure out what type of activation function performs well if the size of a network 
is small [E08]. Some of the neuronal activation functions mentioned frequently in the literature are 


(b) 


FIGURE 3.4 Three alternatives for feedback-type neural networks having single output. 
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TABLE 3.2 Neuronal Activation Functions 


Name Activation Function Adjustables 
Hyperbolic tangent f(x) = tanh(x) _ 
Damped polynomial f(x) = (ax? + bx + c) exp(-Ax*) a,b,c, 
M 

Multilevel fx)= abr , tanb(x-ak) MA 
Trigonometric f(x) =a sin(px) + b cos(qx) a, b, p,q 
Arctangent f(x) = atan(x) = 

sin(x) 
Sinc joo=| m= (X*O = 

1 x=0 

In(l+x) x20 

Logarithmic feay=| Wy ner — 


tabulated in Table 3.2, which shows the diversity of options in setting up a neural network, and choosing 
the best activation scheme is still an active research involving the problem in hand as well. 


3.4 Neural Networks for Identification and Control 


3.4.1 Generating the Training Data 


A critically important component in developing a map from one domain to another is dependent upon 
the descriptive nature of the entity that lies between these domains and that describes the map implic- 
itly. This entity is the numerical data, or the raw data to be used to build the desired mapping. To be more 
explicit, for a two-input single-output mapping, it may be difficult to see the global picture based on the 
few samples shown in Figure 3.5a, yet it is slightly more visible from Figure 3.5b and it is more or less a 
plane according to the picture in Figure 3.5c. Indeed, the sample points in all three points were gener- 
ated using y = 3u, + 4u,. This simple experiment shows us that in order to describe the general picture, 
the data must be distributed well around the questioned domain as well as it must be dense enough to 
deduce the general behavior. 

In the applications reporting solutions to synthetic problems, the designer is free to generate as much 
data as he or she needs; however, in real-time problems, collecting data to train a neural network may 
be a tedious task, or even sometimes a costly one. Furthermore, for problems that have more than two 
inputs, utilizing the graphical approaches may have very limited usefulness. This discussion amounts 
to saying that if there are reasonably large number of training data describing a given process, a neural 
network-based model is a good alternative to realized the desired map. 


@ ° 1 (b) 


FIGURE3.5 Changing implication of some amount of data. 
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3.4.2 Determining the Necessary Inputs: Information Sufficiency 


Previous discussion can be extended to the concept of information sufficiency. For the plane given 
above, a neural model would need u, and u, as its inputs. Two questions can be raised: 


¢ Would the neural model be able to give acceptable results if it was presented only one of the 
inputs, for example, u, or u,? 

¢ Would the neural model perform better if it was presented another input containing some linear 
combination of u, and u,, say for example, u, = Ou, + OU, (4, O # 0)? 


The answer to both questions is obviously no, however, in real-life problems, there may be a few hun- 
dred variables having different amounts of effect on an observed output, and some—though involved 
in the process—may have negligibly small effect on the variables under investigation. Such cases must 
be clearly analyzed, and this has been a branch of neural network-based control research studied in the 
past [LP07]. 


3.4.3 Generalization or Memorization 


Noise is an inevitable component of real-time data. Consider a map described implicitly by the data 
shown in Figure 3.6. The circles in the figure indicate the data points given to extract the shown target 
curve. The data given in the first case, the left subplot, does not contain noise and a neural network that 
memorizes, or overfits, the given data points produce a network that well approximates the function. 
Memorization in this context can formally be defined as J = 0, and this can be a goal if it is known that 
there is no noise in the data. If the designer knows that the data is noisy as in the case shown in the 
middle subplot, a neural network that overfits the data will produce a curve, which is dissimilar from 
the target one. Pursuing memorization for the data shown in the right subplot will produce a neural 
model that performs even worse. This discussion shows that memorization, or overfitting, for real life 
and possibly noisy data is likely to produce poorly performing neural models; nevertheless, noise is the 
major actor determining the minimum possible value of J for a given set of initial conditions and net- 
work architecture. 

In the noisy cases of Figure 3.6, a neural network will start from an arbitrary curve. As training pro- 
gresses, the curve realized by the neural model will approach the target curve. After a particular instant, 
the neural model will produce a curve that passes through the data points but is not similar to the target 
curve. This particular instant is determined by utilizing two sets during the training phase. One is used 
for synaptic modification while the other is used solely for checking the cost function. If the cost func- 
tions computed over both sets decrease, the network is said to generalize the given map. If the cost for 
training set continues to decrease while that for the test set starts increasing, the neural network is said 
to start overfitting the given data, and this instant is the best instant to stop the training. 


FIGURE 3.6 Effect of noise variance on the general perception. 
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3.4.4 Online or Offline Synthesis 


A neural network-based controller can be adaptive or nonadaptive depending on the needs of the prob- 
lem in hand and the choice of the user. Nonadaptive neural network controller is trained prior to instal- 
lation and its parameters are frozen. This approach is called offline synthesis (or training). The other 
alternative considers a neurocontroller that may or may not be pretrained but its weights are refined 
during the operation. This scheme requires the extraction of an error measure to set up the update laws. 
The latter approach is more popular as it handles the time variation in the dynamical properties of a 
system and maintains the performance by altering the controller parameters appropriately. This scheme 
is called online learning. A major difference between these methods is the number of training data 
available at any instant of time. In the offline synthesis, the entire set of data is available, and there is no 
time constraint to accomplish the training, whereas in the online training, data comes at every sampling 
instant and its processing for training must be completed within one sampling period, and this typically 
imposes hardware constraints into the design. 


3.5 Neurocontrol Architectures 


Consider the synthetic process governed by the difference equation x,,, = f(x, u,), where x, is the value of 
state at discrete time fand u, is the external input. Assume the functional details embodying the process 
dynamics are not available to follow conventional design techniques, and assume it is possible to run 
the system with arbitrary inputs and arbitrary initial state values. The problem is to design a system that 
observes the state of the process and outputs u,, by which the system state follows a given reference r,. 
A neural network-based solution to this problem prescribes the following steps. 


¢ Initialize the state x, to a randomly selected value satisfying x,€ Ve KR, where ¥ is the interval 
we are interested in. 

¢ Choose a stimulus satisfying u, € U/e KR, where U stands for the interval containing likely control 
inputs. 

¢ Measure/evaluate the value of x,,,. 

¢ Form an input vector I, = (x,,; x, and an output O, = u, and repeat these four steps P times to 
obtain a training data set. 

¢ Perform training to minimize the cost in Equation 3.1 and stop training when a stopping crite- 
rion is met. 


A neural network realizing the map u, = NN(x;,,, x,) would answer the question “What would be the 
value of the control signal if a transition from x,,, to x, is desired?” Clearly, having obtained a properly 
trained neural controller, one would set x,,, = r,,; and the neurocontroller would drive the system state 
to the reference signal as it had been trained to do so. 

If x,€ KR", then we have r,e KR" and naturally I, € ¥?". Similarly, multiple inputs in a dynamic system 
would require a neurocontroller to have the same number of inputs as the plant contains and the same 
approach would be valid. A practical difficulty in training the controller utilizing the direct synthesis 
approach is that the designer decides on the excitation first, that is, the set U/, however, it can be practi- 
cally difficult to foresee the interval to which a control signal in real operation belongs. This scheme can 
be called direct inversion. In the rest of this section, common structures of neurocontrol are summarized. 

In Figure 3.7, identification of a dynamic system is shown. ‘The trainer adjusts the neural network 
identifier (Shown as NN-I) parameters in such a way that the cost given in (3.1) is minimized over the 
given set of training patterns. The identification scheme depicted in Figure 3.7, can also be utilized to 
identify an existing controller acting in a feedback control system. This approach corresponds to the 
mimicking of a conventional controller [N97]. 

In Figure 3.8, indirect learning architecture is shown. An input signal u is applied to the plant, it 
responds to the signal, which is denoted by y. A neural network controller, (NN-C in the figure) shown 
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i 


FIGURE 3.8 Indirect learning architecture. 


FIGURE 3.9 Closed-loop direct inverse control. 


at the bottom, receives the response of the plant and tries to reconstruct the input u based on this obser- 
vation. In the same time, the modified neural network is copied in front of the plant, which inverts the 
plant dynamics as time passes [PSY88,N97]. 

In Figure 3.9, closed-loop direct inverse control is depicted. The neural network acts as the feed- 
back controller, and it receives the tracking error as the input. A trainer modifies the neural network 
parameters to force the plant response to what the command signal r prescribes. The tracking error is 
used directly as a measure of the error caused by the controller, and the scheme is called direct inver- 
sion in closed loop. 

Open-loop version of the inversion method shown in Figure 3.9 is illustrated in Figure 3.10, which 
is also called the specialized learning architecture in the literature [PSY88]. According to the shown 
connectivity, the neural inverter receives the command signal (r) and outputs a control signal (u), in 
response to which the plant produces a response denoted by y and the error defined by r - y is used to 
tune the parameters of the neural controller. 

As highlighted by Narendra and Parthasarathy, indirect adaptive control using neural networks is 
done as shown in Figure 3.11. An identifier develops and refines the forward model of the plant, and 
a controller receives the equivalent value of the output error after passing through the neural model 
utilizing backpropagation technique. Such a method extracts a better error measure to penalize the 
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FIGURE 3.11 Indirect adaptive control scheme. 


controller parameters. Since the controller tuning needs an extra propagation of information via 
the neural identifier, the scheme is called indirect adaptive control [NP91], or forward and inverse 
modeling [N97]. 

In many applications of neurocontrol, the plant, when discretized has the form y,, = 
f Ves Vets 0s Ve-) + BV ts Vt-1s «+ +s Vt-1,) Ue- Such models can enjoy the feedback linearization technique 
and when some set of consecutively collected numerical observations are available, one can proceed to 
develop the functions f(-) and g(-) separately, as shown in Figure 3.12. A classical control scheme supplied 
by the estimates of f(-) and g(-) can drive the system output to a given command signal adaptively. 

Feedback error learning architecture proposed by [KFS87] is given in Figure 3.13, where a con- 
ventional controller is placed to stabilize the plant. It is assumed that the stabilizing effect provided 


FIGURE 3.12 Feedback linearization via neural networks. 
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Trainer 


FIGURE 3.13 Feedback error learning architecture. 


Trainer 


FIGURE 3.14 A typical neural network-based control architecture. 


by the conventional controller may not be perfect and the neural controller provides a corrective 
control by acting in the feedforward path. 

The typical diagrammatic view of neurocontrol is as illustrated in Figure 3.14, where there may be 
two controllers operated simultaneously and tapped delay line (TDL) block provides some history of 
relevant variables. A conventional controller—if used—maintains the stability of the closed loop and 
a neurocontroller complements it. Depending on the design and expectations, one of these controllers 
play the primary role while the other carries the auxiliary role. In every application example, a trainer 
should be provided an error measure quantifying the distance between the current output and the target 
value of it. 


3.6 Application Examples 
3.6.1 Propulsion Actuation Model for a Brushless DC Motor-Propeller Pair 


The dynamical model of an unmanned aerial vehicle (UAV) like the one shown in Figure 3.15 could 
be obtained using the laws of physics. Principally, a control signal to be applied to the motors must be 
converted to pulse width modulation (pwm) signals, then electronic speed controllers properly drive the 
brushless motors, and a thrust value is obtained from each motor-propeller pair. The numerical value 
of the thrust is dependent upon the type of the propeller, and the angular speed of the rotor in radians 
is f = bQ} with f, is the thrust at ith motor, b is a constant-valued thrust coefficient, and Q, is the angu- 
lar speed in rad/s. If the control inputs (thrusts) needed to observe a desired motion were immediately 
available, then it would be easier to proceed to the closed-loop control system design without worrying 
about the effects of the actuation periphery, which introduces some constraints shaping the transient 
and steady-state behavior of the propulsion. Indeed, the real-time picture is much more complicated as 
the vehicle is an electrically powered one and battery voltage is reducing. Such a change in the battery 
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(b) (c) 


FIGURE3.15 (a) Schematic view and variable definitions of a quadrotor-type UAV, (b) CAD drawing, and (c) real 
implementation. 


voltage causes different lift forces at different battery voltage levels although the applied pwm level is 
constant, as seen in Figure 3.16. Same pwm profile is applied repetitively and as the battery voltage 
reduces, the angular speed at a constant pwm level decreases thereby causing a decrease in the generated 
thrust. Furthermore, the relation with different pwm levels is not linear, that is, same amount of change 
in the input causes different amounts of change at different levels, and this shows that the process to be 
modeled is a nonlinear one. 


Q (rad/s) 
SBSRBSRES ERR 


0 10 20 30 40 0 5 10 15 #20 25 30 35 40 
(a) Time (s) (b) Time (s) 


FIGURE 3.16 (a) Applied pwm profile and (b) decrease in the angular speed as the battery voltage decreases. 
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According to Figure 3.16, comparing the fully charged condition of the battery and the condition 
at the last experiment displays 15g of difference for the lowest level, 154g at the highest level, which is 
obviously an uncertainty that has to be incorporated into the dynamic model and a feedback control- 
ler appropriately. Use of neural networks is a practical alternative to resolve the problem induced by 
battery conditions. Denoting V,(f) as the battery voltage, a neural network model performing the map 
Yowm = NN(Q, V,) is the module installed to the output ofa controller generating the necessary angular 
speeds. Here 2. is the angular speed prescribed by the controller. Another neural network that imple- 
ments yg = NN(V,, pwm, 0,(pwm)) is the module installed to the inputs of the dynamic model of the 
UAV. The function 6,(-) is a low-pass filter incorporating the effect of transient in the thrust value. The 
dynamic model contains f,s that are computed using Q;s. 

The reason why we would like to step down from thrusts to the pwm level and step up from pwm level 
to forces is the fact that brushless DC motors are driven at the pwm level and one has to separate the 
dynamic model of the UAV and the controller by drawing a line exactly at the point of signal exchange 
occurring at the pwm level. Use of neural networks facilitates this in the presence of voltage loss in the 
batteries. 

In Figure 3.17, the diagram describing the role of aforementioned offline trained neural models 
are shown. In Figure 3.18, the results obtained with real-time data are shown. A chirp-like pwm 
profile was generated and some noise added to obtain a pwm signal to be applied. When this signal is 
applied as an input to any motor, the variation in the battery voltage is measured and filtered to guide 


Controller side Dynamic model side 
pwm QO predicting 
predictor NN NN 


pwm a 
O 


‘2c 


I 
f: 
is) 


Filtered battery voltage 


FIGURE 3.17 _ Installing the neural network components for handshaking at pwm level. 
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FIGURE 3.18 Performance of the two neural network models. 


the neural models, as shown in the top right subplot. After that, the corresponding angular speed is 
computed experimentally. In the middle left subplot, the reconstructed pwm signal and the applied 
signal are shown together whereas the middle right subplot depicts the performance for the angular 
speed predicting neural model. Both subplots suggest a useful reconstruction of the signal asked from 
the neural networks that were trained by using Levenberg-Marquardt algorithm. In both models, the 
neural networks have single hidden layer with hyperbolic tangent-type neuronal nonlinearity and 
linear output neurons. The pwm predicting model has 12 hidden neurons with J = 90.8285 x 10-4 as 
the final cost, angular speed predicting neural model has 10 hidden neurons with J = 3.9208 x 10-4 
as the final cost value. 
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Bottom subplots of Figure 3.18 illustrate the difference between the desired and predicted values. 
As the local frequency of the target output increases, the neural models start performing poorer yet the 
performance is good when the signals change slowly. This is an expected result that is in good compliance 
with the typical real-time signals obtained from the UAV in Figure 3.15. 


3.6.2 Neural Network—Aided Control of a Quadrotor-Type UAV 


The dynamic model of the quadrotor system shown in Figure 3.15 can be derived by following the 
Lagrange formalism. The model governing the dynamics of the system is given in (3.15) through (3.20), 
where the first three ODEs describe the motion in Cartesian space, the last three define the dynamics 
determining the attitude of the vehicle. The parameters of the dynamic model are given in Table 3.3. 


X = (cososin@cos y + singsin y) 7 U; (3.15) 
y =(cososin Osin y —sindcos y) a U; (3.16) 
és 1 

=-g+ 6—U; 3.17 
Z=-g +coscos mu (3.17) 

- 2. ( Iy—Iz a L 
= oy) 2 = + + 60+ —U, 3.18 
> “( Te ) = Te 2 (3.18) 
6=Gy| 2— | 4 p+, (3.19) 

Iy Iy Ty 
w= 6 feTy |, 1p, (3.20) 

i Iz 
where 

oO= Q, -Q, +Q; -Q, (3.21) 
U, = bQf +bQ} +b03 +604 = fitfrt fat fi (3.22) 


TABLE 3.3 Physical Parameters of the Quadrotor UAV 


L Half distance between two motors 0.3m 
on the same axis 
M Mass of the vehicle 0.8kg 
g Gravitational acceleration constant 9.81 m/s? 
LL Moment of inertia around x-axis 15.67e-3 
Ly Moment of inertia around y-axis 15.67e-3 
ya Moment of inertia around z-axis 28.346e-3 
b Thrust coefficient 192.3208e-7 N s? 
d Drag coefficient 4.003e-7 N m s? 
i Propeller inertia coefficient 6.0le-5 
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FIGURE 3.19 Block diagram of the overall control system. 


Uz = bOQ{ —bQ3 = fr- fh (3.23) 
U3 = bQ3 - bQ? = fs — fi (3.24) 
Ug = d(Q? —Q3 +.Q3 -3) (3.25) 


The control problem here is to drive the UAV toward a predefined trajectory in the 3D space by generat- 
ing an appropriate sequence of Euler angles, which need to be controlled as well. The block diagram of 
the scheme is shown in Figure 3.19, where the attitude control is performed via feedback linearization 
using a neural network, and outer loops are established based on classical control theory. 

The role of the linearizing neural network is to observe the rates of the roll (6), pitch (6), yaw () 
angles, and the ® parameter, and to provide a prediction for the function given as follows: 


TS) jee 
Oy| 22 —* |+  0w 
W( fe a 


NN = F=(6,6,,@) = oi ==) . bw (3.26) 
yy yy 
ei iy ees 
6 xx 7 A yy 
(8) 


The observations are noisy and based on the dynamic model, a total of 2000 pairs of training data and 
200 validation data are generated to train the neural network. Levenberg—Marquardt training scheme 
is used to update the network parameters and the training was stopped after 10,000 iterations, the final 
MSE cost is J = 3.3265e-7, which was found acceptable. In order to incorporate the effects of noise, the 
training data were corrupted 5% of the measurement magnitude to maintain a good level of generality. 
The neural network realizing the vector function given in (3.26) has two hidden layers employing hyper- 
bolic tangent-type neuronal activation functions, the output layer is chosen to be a linear one. The first 
hidden layer has 12, and the second hidden layer has 6 neurons. With the help of the neural network, the 
following attitude controller has been realized: 


U; ="= (Kye +2,|Kyé¢+6,-NNi] (3.27) 
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Lx Z 
0 (Keeo +2/Ko ¢o+ 6, -NNo} (3.28) 
U,=1, (Kyey +2,/Ky éy+W,— NN3) (3.29) 
where 
9 = o, ~ ) 
€,=9,-9 
ey =V,-VW 


a variable with subscript r denotes a reference signal for the relevant variable 


Since the motion in Cartesian space is realized by appropriately driving the Euler angles to their desired 
values, the Cartesian controller produces the necessary Euler angles for a prespecified motion in 3D space. 
The controller introducing this ability is given in (3.30) through (3.33). The law in (3.30) maintains the 
desired altitude, and upon writing it into the dynamical equations in (3.15) through (3.17) with small-angle 
approximation, we get the desired Euler angle values as given in (3.31) through (3.32). Specifically, turns 
around z-axis are not desired, and we impose y, = 0 as given in (3.33). 


py, =e 2) 


(3.30) 
cosdcos 8 
o, = scan 5 ] (3.31) 
F,+g 
6, = ta a ] (3.32) 
F,+g 
w, =0 (3.33) 
where 
F,=-4z,- 4(z -z,) 
F,= Ven (y-y,) 


F,=-%,- (x - x,) 
Ky Kj=4,4,=9 


The results for the feedback control are obtained via simulations, as shown in Figures 3.20 and 3.21. 
In the upper row of Figure 3.20, the trajectory followed in the 3D space is shown first. The desired path 
is followed at an admissible level of accuracy under the variation of battery voltage shown in the middle 
subplot. The practice of brushless motor-driven actuation scheme modulates the entire system, and a 
very noisy battery voltage is measured. The filtered battery voltage is adequate for predicting the nec- 
essary pwm level discussed also in the previous subsection. The bottom row of Figure 3.20 shows the 
errors in Cartesian space. Since the primary goal of the design is to maintain a desired altitude, perfor- 
mance in z-direction is comparably good from the others. The errors in x- and y-directions are also due 
to the imperfections introduced by the small-angle approximation. Nevertheless, comparing with the 
trajectories in the top left subplot, the magnitudes of the shown errors are acceptable too. In Figure 3.21, 
the attitude of the vehicle and the errors in the Euler angles are shown. Despite the large initial 
errors, the controller is able to drive the vehicle attitude to its desired values quickly, and the prescribed 
desired attitude angles are followed with a good precision. 
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FIGURE 3.20 The trajectory followed in the Cartesian space. 
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FIGURE 3.21 The attitude of the vehicle and the errors in the Euler angles. 


Overall, the neural network structures in such a multivariable nonlinear feedback control example 
provide handshaking in between the actuation model and the dynamic model of the plant for alleviating 
the difficulties caused by the variations in the battery voltage; second, they provide linearization of the 
attitude dynamics to guide the vehicle correctly in the 3D space. In both positions, the neural networks 
function well enough as the application presented here is a good test bed for real-time data collection 
and numerical data-based approaches like neural network tools. 
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3.7 Concluding Remarks 


Ever-increasing interest to neural networks in control systems led to the manufacturing of software 
tools that are of common use, hardware tools coping with computational intensity and a number of 
algorithms offering various types of training possibilities. Most applications of neural networks are 
involved with several number of variables changing in time and strong interdependencies among the 
variables. Such cases require a careful analysis of raw data as well as analytical tools to perform neces- 
sary manipulations. The main motivation of using neural networks in such applications is to solve the 
problem of constructing a function whose analytical form is missing but a set of data about it is avail- 
able. With various types of architectural connectivity, activation schemes, and learning mechanisms, 
artificial neural networks are very powerful tools in every branch of engineering, and their impor- 
tance in the discipline of control engineering is increasing parallel to the increase in the complexity 
being dealt with. In the future, the neural network models will continue to exist in control systems that 
request some degrees of intelligence to predict necessary maps. 
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4.1 Introduction to Intelligent Control 


For the purposes of system control, much valuable knowledge and many techniques, such as feedback 
control, transfer functions (frequency or discrete-time domain), state-space time-domain, optimal 
control, adaptive control, robust control, gain scheduling, model-reference adaptive control, etc., have 
been investigated and developed during the past few decades. Different important concepts such as root 
locus, Bode plot, phase margin, gain margin, eigenvalues, eigenvectors, pole placement, etc., have been 
imported from different areas to, or developed in, the control field. 

However, most of these control techniques rely on system mathematical models in their design pro- 
cess. Control designers spend more time in obtaining an accurate system model (through techniques 
such as system identification, parameter estimation, component-wise modeling, etc.) than in the design 
of the corresponding control law. Furthermore, many control techniques, such as transfer function 
approach, in general, require the system to be linear and time invariant; otherwise, linearization tech- 
niques at different operating points are required in order to arrive at an acceptable control law/gain. 
With the use of system mathematical models, especially a linear-time invariant model, one can certainly 
enhance the theoretical support of the developed control techniques. However, this requirement 
creates another fundamental problem: How accurately does the mathematical model represent the sys- 
tem dynamics? In many cases, the mathematical model is only an approximated rather than an exact 
model of the system dynamics being investigated. This approximation may lead to a reasonable, but not 
necessarily good, control law for the system of interest. 


4-1 
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For example, proportional integral (PI) control, which is simple, well known, and well suited for the con- 
trol of linear time-invariant systems, has been used extensively for industrial motor control. The design 
process to obtain the PI gains is highly tied to the mathematical model of the motor. Engineers usually first 
design a PI control based on a reasonable accurate mathematical model of the motor, then use the root locus/ 
Bode Plot technique to obtain suitable gains for the controller to achieve desirable motor performance. Then 
they need to tune the control gain on-line at the beginning of the use of the motor controllers in order to 
give acceptable motor performance for the real world. The requirement of gain tuning is mostly due to the 
unavoidable modeling error embedded in the mathematical models used in the design process. The motor 
controller may further require gain adjustments during on-line operations in order to compensate for the 
change in system parameters due to factors such as system degradation, change of operating conditions, etc. 
Adaptive control has been studied to address the changes in system parameters and have achieved a certain 
level of success. Gain scheduling has been studied and used in control loop [1,2] so that the motor can give 
satisfactory performance over a large operating range. The requirement of mathematical models imposes 
artificial mathematical constraints on the control design freedom. Along with the unavoidable modeling 
error, the resulting control laws in many cases give an overconservative motor performance. 

There are also other different control techniques, such as set-point control, sliding mode control, fuzzy 
control, neural control, that rely less on mathematical model of the system but more on the designer's 
knowledge about the actual system. Especially, intelligent control has been attracting significant attention 
in the last few years. Different articles and experts’ opinions have been reported in different technical arti- 
cles. A control system, which incorporates human qualities, such as heuristic knowledge and the ability to 
learn, can be considered to possess a certain degree of intelligence. Such intelligent control system has an 
advantage over the purely analytical methods because, besides incorporating human knowledge, it is less 
dependent on the overall mathematical model. In fact, human beings routinely perform very complicated 
tasks without the aid of any mathematical representations. A simple knowledge base and the ability to 
learn by training seem to guide humans through even the most difficult problems. Although conven- 
tional control techniques are considered to have intelligence in a low level, we want to further develop 
the control algorithms from the low-level to a high-level intelligent control, through the incorporation of 
heuristic knowledge and learning ability via the fuzzy and neural network technologies, among others. 


4.1.1 Fuzzy Control 


Fuzzy control is considered an intelligent control technique and has been shown to yield promising results 
for many applications that are difficult to be handled by conventional techniques [3-5]. Implementations 
of fuzzy control in areas such as water quality control [6], automatic train operation systems [7], traffic 
control [13], among others, have indicated that fuzzy logic is a powerful tool in the control of mathemat- 
ically ill-defined systems, which are controlled satisfactorily by human operators without the knowledge 
of the underlying mathematical model of the system. While conventional control methods are based on 
the quantitative analysis of the mathematical model of a system, fuzzy controllers focus on a linguistic 
description of the control action, which can be drawn, for example, from the behavior of a human opera- 
tor. This can be viewed as a shift from the conventional precise mathematical control to human-like 
decision making, which drastically changes the approach to automate control actions. 

Fuzzy logic can easily implement human experiences and preferences via membership functions and 
fuzzy rules, from a qualitative description to a quantitative description that is suitable for microproces- 
sor implementation of the automation process. Fuzzy membership functions can have different shapes 
depending on the designer’s preference and/or experience. The fuzzy rules, which describe the control 
strategy in a human-like fashion, are written as antecedent-consequent pairs of | F- THEN statements 
and stored in a table. Basically, there are four modes of derivation of fuzzy control rules: (1) expert 
experience and control engineering knowledge, (2) behavior of human operators, (3) derivation based 
on the fuzzy model of a process, and (4) derivation based on learning. These do not have to be mutually 
exclusive. In later sections, we will discuss more about membership functions and fuzzy rules. 
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Due to the use of linguistic variables and fuzzy rules, the fuzzy controller can be made understandable 
to a nonexpert operator. Moreover, the description of the control strategy could be derived by examin- 
ing the behavior of a conventional controller. The fuzzy characteristics make it particularly attractive for 
control applications because only a linguistic description of the appropriate control strategy is needed in 
order to obtain the actual numerical control values. Thus, fuzzy logic can be used as a general methodol- 
ogy to incorporate knowledge, heuristics or theory into a controller. 

In addition, fuzzy logic has the freedom to completely define the control surface without the use of 
complex mathematical analysis, as discussed in later sections. On the other hand, the amount of effort 
involved in producing an acceptable rule base and in fine-tuning the fuzzy controller is directly pro- 
portional to the number of quantization levels used, and the designer is left to choose the best tradeoff 
between (1) being able to create a large number of features on the control surface and (2) not having to 
spend much time in the fine-tuning process. The general shape of these features depends on the heu- 
ristic rule base and the configuration of the membership functions. Being able to quantize the domain 
of the control surface using linguistic variables allows the designer to depart from the mathematical 
constraints (e.g., hyperplane constraints in PI control) and achieve a control surface, which has more 
features and contours. 

In 1965, L.A. Zadeh laid the foundations of fuzzy set theory [8], which is a generalization of conven- 
tional set theory, as a method of dealing with the imprecision of the real physical world. Bellman and 
Zadeh write “Much of the decision-making in the real world takes place in an environment in which 
the goals, the constraints and the consequences of possible actions are not known precisely” [9]. This 
“imprecision” or fuzziness is the core of fuzzy logic. Fuzzy control is the technology that applies fuzzy 
logic to solve control problems. 

This section is written to provide readers with an introduction of the use of fuzzy logic to solve control 
problems; it also intends to provide information for further exploration on related topics. In this sec- 
tion, we will briefly overview some fundamental fuzzy logic concepts and operations and then apply the 
fuzzy logic technique for a dc motor control system to demonstrate the fuzzy control design procedure. 
The advantages of using fuzzy control is more substantial when applied to nonlinear and ill-defined 
systems. The dc motor control system is used as an illustration here because most readers with some 
control background should be familiar with this popular control example, so that they can benefit more 
from the fuzzy control design process explained here. If the readers are interested in getting more details 
about fuzzy logic and fuzzy control, please refer to the bibliography section at the end of this chapter. 


4.2 Brief Description of Fuzzy Logic 
4.2.1 Crisp Set 


‘The basic principle of the conventional set theory is that an element is either a member or not a member 
of a set. A set that is defined in this way is called a crisp set, since its boundary is well defined. Consider 
the set, W, of motor speed operating range between 0 and 175 rad/s. The proper motor speed operating 
range would be written as 


W = {we W |0rad/s < w<175 rad/s}. (4.1) 


The set W could be expressed by its membership function W(w), which indicates whether the motor is 
within its operating range: 


1; Orad/s<w<150 rad/s 


(4.2) 
0; otherwise 


W(w) = | 


A graphical representation of W(w) is shown in Figure 4.1. 
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FIGURE 4.1 Crisp membership function for proper speed operating condition defined in Equation 4.1. 


4.2.2 Fuzzy Set 


However, in the real world, sets are not always so crisp. Human beings routinely use concepts that are 
approximate and imprecise to describe most problems and events. The human natural language largely 
reflects these approximations. For example, the meaning of the word “fast” to describe the motor speed 
depends on the person who uses it and the context he or she uses it in. Hence, various speeds may belong 
to the set “fast” to various degrees, and the boundary of the set operating range is not very precise. For 
example, if we wish to consider 150 rad/s as fast, then 160 rad/s is certainty fast. However, do we still say 
147 rad/s is fast too? How about 145 rad/s? Conventional set theory techniques have difficulties in deal- 
ing with this type of linguistic or qualitative problems. 

Fuzzy sets are very useful in representing linguistic variables, which are quantities described by 
natural or artificial language [10] and whose values are linguistic (qualitative) and not numeric 
(quantitative). A linguistic variable can be considered either as a variable whose value is a fuzzy 
number (after fuzzification) or as a variable whose values are defined in linguistic terms. Examples 
of linguistic variables are fast, slow, tall, short, young, old, very tall, very short, etc. More specifi- 
cally, the basic idea underlying the fuzzy set theory is that an element is a member of a set to a 
certain degree, which is called the membership grade (value) of the element in the set. Let Ube a col- 
lection of elements denoted by {u}, which could be discrete or continuous. U is called the universe 
of discourse and u represents the generic element of U. A fuzzy set A ina universe of discourse U is 
then characterized by a membership function A(.) that maps U onto a real number in the interval 
[Awnins Amaxl- If Amin = 0 and A,,,, = 1, the membership function is called a normalized membership 
function and A: Ue [0,1]. 

For example, a membership value A(u) = 0.8 suggests that u is a member of A to a degree of 0.8, on 
a scale where zero is no membership and one is complete membership. One can then see that crisp set 
theory is just a special case of fuzzy set theory. A fuzzy set A in U can be represented as a set of ordered 
pairs of an element u and its membership value in A: A = {(u, A(u))|u € U}. The element u is sometimes 
called the support, while A(u) is the corresponding membership function of u of the fuzzy set A. When 
Uis continuous, the common notation used to represent the fuzzy set A is 


a= fo as 


u 
U 
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and when U is discrete, the fuzzy set A is normally represented as 


(4.4) 


ae > Aw _ Alm) , AG) 4, AUn) 
oe Uy Uz a 


where n is the number of supports in the set. It is to be noted that, in fuzzy logic notation, the summa- 
tion represents union, not addition. Also, the membership values are “tied” to their actual values by the 
divide sign, but they are not actually being divided. 


4.3 Qualitative (Linguistic) to Quantitative Description 


In this section, we use height as an example to explain the concept of fuzzy logic. In later sections, we 
will apply the same concept to the motor speed control. The word “tall” may refer to different heights 
depending on the person who uses it and the context in which he or she uses. Hence, various heights 
may belong to the set “tall” to various degrees, and the boundary of the set tall is not very precise. Let 
us consider the linguistic variable “tall” and assign values in the set 0 to 1. A person 7 ft tall may be 
considered “tall” with a value of 1. Certainly, any one over 7 ft tall would also be considered tall with a 
membership of 1. A person 6 ft tall may be considered “tall” with a value of 0.5, while a person 5ft tall 
may be considered tall with a membership of 0. As an example, the membership function [y,;;, which 
maps the values between 4 and 8 ft into the fuzzy set “tall,” is described continuously by Equation 4.5 
and shown in Figure 4.2: 


V1 —5(height—6) 
TALL = fyary (height) = J ie) (4.5) 
aa height 
Membership functions are very often represented by the discrete fuzzy membership notation when 
membership values can be obtained for different supports based on collected data. For example, a five- 
player basketball team may have a fuzzy set “TALL” defined by 


TALL =ptyay, (height) = "= i: 0.68 x 0.82 mn 0.92 " 0.98 FA we (4.6) 


62 64 66 68 7 


where the heights are listed in feet. This membership function is shown in Figure 4.3. 


14s OO 


0.8 


0.6 


0.4 


rari (height) 


026 


Height 


FIGURE 4.2 Continuous membership function plot for linguistic variable “TALL” defined in Equation 4.5. 
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FIGURE 4.3 Discrete membership function plot for linguistic variable “TALL” defined in Equation 4.6. 


In the discrete fuzzy set notation, linear interpolation between two closest known supports and cor- 
responding membership values is often used to compute the membership value for the u that is not 
listed. For example, if a new player joins the team and his height is 6.1 ft, then his TALL membership 
value based on the membership function defined in (4.6) is 


6.1—6(0.68—0.5) _ 


6.1) =0.5+ 
Ura (6.1) 62-6 


0.59. (4.7) 


Note that the membership function defined in Equation 4.5 is normalized between 0 and 1 while the one 
defined in Equation 4.6 is not, because its minimum membership value is 0.5 rather than 0. 

It should be noted that the choice of a membership function relies on the actual situation and is 
very much based on heuristic and educated judgment. In addition, the imprecision of fuzzy set theory 
is different from the imprecision dealt with by probability theory. The fundamental difference is that 
probability theory deals with randomness of future events due to the possibility that a particular event 
may or may not occur, whereas fuzzy set theory deals with imprecision in current or past events due 
to the vagueness of a concept (the membership or nonmembership of an object in a set with imprecise 
boundaries) [11]. 


4.4 Fuzzy Operations 


The combination of membership functions requires some form of set operations. Three basic opera- 
tions of conventional set theory are intersection, union, and complement. The fuzzy logic counter- 
parts to these operations are similar to those of conventional set theory. Fuzzy set operations such as 
union, intersection, and complement are defined in terms of the membership functions. Let A and B 
be two fuzzy sets with membership functions 1, and Uz, respectively, defined for all ue U. The TALL 
membership function has been defined in Equation 4.6, and a FAST membership function is defined 
as (Figure 4.4) 


FAST = [lpasr (height) = 0.5 e 0.8 # 0.9 1 0.75 0.6 


+—+ + 4. 
6 62 64 6 68 7 ee) 


These two membership functions will be used in the next sections as examples for fuzzy operation 
discussions. 
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Hrasr (height) 


Height 


FIGURE 4.4 Discrete membership function plot for linguistic variable “FAST” defined in Equation 4.8. 
4.4.1 Union 


The membership function [,_,, of the union A U B is pointwise defined for all ue Uby 
Hava(u) = max{[ls(u),Ma(u)} = Vt Ha), a (w)}, (4.9) 


where v symbolizes the max operator. As an example, a fuzzy set could be composed of basketball players 
who are “tall or fast.” The membership function for this fuzzy set could be expressed in the union notation as 


Urarturast (height) = max {rat (height), Urasr (height) } 


_ 0.5v0.5) | 0.68V 0.8) , (0.82v0.9) (0.921) | (0.98V0.98) | (0.61) 
6 6.2 6.4 6.6 6.8 7 


05 08 09 1 0.98 1 
=—+——+—+—+ + 


: (4.10) 
6 62 64 66 68 7 
The corresponding membership function is shown in Figure 4.5. 
4.4.2 Intersection 
The membership function [,,,, of the intersection A M B is pointwise defined for all ue Uby 
Mang(u) = min{u,(u),He(u)}= A{Mas (u),Mp(u)}, (4.11) 


where A symbolizes the min operator. As an example, a fuzzy set could be composed of basketball 


players who are “TALL and FAST.” The membership function for this fuzzy set could be expressed in 
the intersection notation as 


Lratiorasr(height) = min {rat (height), Heasr (height)} 


_ (0.50.5) es (0.68 A 0.8) rs (0.82 A0.9) n (0.92 1) (0.98 A 0.98) - (0.6 A 1) 
6 6.2 6.4 6.6 6.8 7 
0.5 0.68 0.82 0.92 0.75 0.6 
=— + + + + +—. 
6 62 64 66 68 7 


(4.12) 


The corresponding membership function is shown in Figure 4.6. 
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FIGURE4.5 Membership function plot of Hrarurasr(height) defined in (4.10). 


1 
0.9 +_ 
0.8 +— 
0.7 -— 


0.6 +— 


0.5 


—m— FAST (height) ~*~ Fast (height) 


FIGURE 4.6 Membership function plot of Upyrr,rasr(height) defined in Equation 4.12. 


4.4.3 Complement 


Also, the membership function Uz of complement of the fuzzy set A is pointwise defined for all ue Uby 
bg(u) =1-p,(u). (4.13) 
The fuzzy set “short” could be considered to be the complement of the fuzzy set TALL expressed as 


Usnorr(¥) = Wea (4) = 1- rari (u). (4.14) 
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—e— TALL (height) —m— SHORT (height) 


FIGURE4.7 Membership functions [spopr(height) and Uy,;,(height) defined in (4.5) and (4.14). 
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FIGURE4.8 Membership functions [syogr(height) and [7,11 (height) defined in (4.6) and (4.14). 


The SHORT fuzzy set correspondent to the TALL fuzzy set defined in Equation 4.5 is shown in Figure 4.7. 

The SHORT fuzzy set correspondent to the TALL fuzzy set defined in Equation 4.6 is shown in Figure 4.8. 

Other operations exist for fuzzy sets. Different modifications of the union, intersection, and comple- 
ment have also been proposed and used. For example, unions and intersections of various strengths can 
be achieved by using the Yager class (among others) to perform the fuzzy union and intersection opera- 
tions. But the three operations described above represent the most popular ones. 


4.5 Fuzzy Rules, Inference 


4.5.1 Fuzzy Relation/Composition/Conditional Statement 


Another concept of fuzzy logic is the fuzzy relation. Fuzzy logic techniques can be used to translate natural 
language into heuristic responses and also to combine fuzzy membership functions to formulate fuzzy 
rules. A fuzzy relation R from a set X to a set Y is defined as a fuzzy subset of the Cartesian product X x Y, 
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which is the collection of ordered pairs (x, y), where x € X, andy Y. In the same way that a membership 
function defines an element’s membership in a given set, the fuzzy relation is a fuzzy set composed of all 
combinations of participating sets, which determines the degree of association between two or more ele- 
ments of distinct sets. It is characterized by a bivariate membership function Ug(x,y), written as 


Ry(x)= Hay) in continuous membership function notation, (4.15) 
XxY (x, y ) 
and 
Ry(x) = BD in discrete membership function notation. (4.16) 
i=l Xi» Yi 
Let us consider a fuzzy rule: 
Ifa player is much taller than 6.4 feet, hisscoring average per game should be high. (4.17) 


Let the fuzzy sets be 


X = much taller than 6.4 ft(with [6 ft, 7 ft] as the universe of discourse) 


0 0 02 05 08 1 
=—+—+—+—+— + -, (4.18) 
6 62 64 66 68 7 
and 
Y = scoring -average per game (with [0, 20] as the universe of discourse) 
3 08 0. 1 
= + : + + 2 + (4.19) 


0 5 10 15 20 


If we use the min operator to form the R (other operators such as “product” is another popular choice 
[12]), then the consequent relational matrix between X and Y is 


X\Y 0 5 10 15 20 
6 0A0 0A0.3 0A0.8 0A0.9 OA1 
6.2 0A0 0A0.3 0A0.8 0A0.9 OAl 


Ry(x)= 64 0.2A0 0.2A0.3 0.2A0.8 0.2A0.9 0.2A1 
6.6 0.5A0 0.5A0.3 0.5A0.8 0.5A0.9  O5A1 
6.8 0.8A0 0.8A0.3 0.8A0.8 0.8A0.9 O08A1 


7 1A0 —-1A0.3. -1A08 ~)—-1A0.9 AI 
X\Y | 0 10 15 20 
6 | 0 
6.2 |0 0 0 0 
= 64 |0 02 02 02 02 (4.20) 
66 |0 03 05 05 05 
68 |0 03 08 08 08 
7 |0 03 O08 09 1 
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Note that in (4.20), min R,(x) = 0, and max R,(x) = 1. Therefore, a player with height 6.8 ft and a scoring 
average of 10 per game should has a fuzzy relation 0.8 in a scale between 0 and 1. 


4.5.2 Compositional Rule of Inference 


Compositional rule of inference, which may be regarded as an extension of the familiar rule of modus 
ponens in classical propositional logic, is another important concept. Specifically, if R is a fuzzy rela- 
tion from X to Y, and x is a fuzzy subset of X, then the fuzzy subset y of Y that is induced by x is given 
by the composition of R and x, in the sense of Equation 4.12. Also, if Ris a relation from X to Y and Sis 
a relation from Y to Z, then the composition of R and S is a fuzzy relation denoted by RO S, defined by 


RoS= J max(min(t12(x;y), us(y.2))} 


ia (4.21) 
XXZ 
Equation 4.21 is called the max-min composition of R and S. 
For example, if a player is about 6.8 ft with membership function 
0 0 Ey 0 02 1 02 (4.22) 


x=—+ +— + ; 
6 62 64 66 68 7 


and we are interested in his corresponding membership function of scoring average per game, then 


0 0 0 
0 0 
. 0 02 O02 O02 0.2 
a eae aL aia [9 9 9 02 1 o2]n 0 03 05 05 0.5 
0 03 08 O08 0.8 
0 03 08 0.9 1 
0A0 0A0 0A0 0A0 0A0 
0A0 0A0 0A0 0A0 0A0 


0A0 0A0.2 0A0.2 0A0.2 0A0.2 
. Ee 0.2A0 0.2A0.3  0.2A05 0.2A0.5 0.2 A0.5 
1A0 1A 0.3 1A 0.8 1A 0.8 1A0.8 

0.2A0 0.2A0.3 0.2A0.8 0.2A0.9 O.2A1 


0 OvOvO0vVO0VO0 0 
0 OvOvO0vVO0VO 0 
0 OvO0vO0OvV0V0 0 
=max = = 
Yi 0 02 O02 O2 0.2 0v0.2V0.2V0.2v 0.2 0.2 
0 03 #O8 O08 0.8 0v0.3vV 0.8 V0.8 v0.8 0.8 
0 02 O02 O2 0.2 0V0.2V0.2V0.2v 0.2 0.2 


(4.23) 
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4.5.3 Defuzzification 


After a fuzzy rule base has been formed (such as the one shown in Equation 4.17) along with the par- 
ticipating membership functions (such as the ones shown in Equations 4.22 and 4.23), a defuzzification 
strategy needs to be implemented. The purpose of defuzzification is to convert the results of the rules 
and membership functions into usable values, whether it be a specific result or a control input. There 
are a few techniques for the defuzzification process. One popular technique is the “center-of-gravity” 
method, which is capable of considering the influences of many different effects simultaneously. The 
general form of the center of gravity method is 


fea, (4.24) 


where 
Nis the number of rules under consideration 
uu, and c, are the membership and the control action associated with rule i, respectively 
f, represents the jth control output 


Consider the rule base 


1. if X = tall and Y = fast then C= 1. 


2. if X = short and Y = slow then C = 0. (4.25) 
The “center-of-gravity” method applied to this rule base with N = 2 results in 
fi= min(Mrar(X),Heasr(Y)) X1+ min(Usnorr(X),Msrow(Y)) x0 : (4.26) 


min(Wrar(X),Ueasr(Y)) + min(UsHorr(X), Ustow(Y)) 


where the “min” operator has been used in association with the “and” operation. We will discuss more 
about the details of the defuzzification process in later sections. 


4.6 Fuzzy Control 


The concepts outlined above represent the basic foundation upon which fuzzy control is built. In fact, 
the potential of fuzzy logic in control systems was shown very early by Mamdani and his colleagues [13]. 
Since then, fuzzy logic has been used successfully in a variety of control applications. Since the heuris- 
tic knowledge about how to control a given plant is often in the form of linguistic rules provided by a 
human expert or operator, fuzzy logic provides an effective means of translating that knowledge into an 
actual control signal. These rules are usually of the form 


IF (a set of conditions is satisfied) THEN (the adequate control action is taken), 


where the conditions to be satisfied are the antecedents and the control actions are the consequent of 
the fuzzy control rules, both of which are associated with fuzzy concepts (linguistic variables). Several 
linguistic variables may be involved in the antecedents or the consequents of a fuzzy rule, depending 
on how many variables are involved in the control problem. For example, let x and y represent two 
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important state variables of a process, and let w and z be the two control variables for the process. In this 
case, fuzzy control rules have the form 


Rule,: if xis A, and y is B, then w is C, and zis D, 
Rule,: if x is A, and y is B, then w is C, and z is D, 


Rule,: if x is A, and y is B, then w is C,, and zis D,, 


where A,, B,, C;, and D, are the linguistic values (fuzzy sets) of x, y, w, and z, in the universes of discourse 
X, Y, W, and Z, respectively, with i = 1,2,...,n. Using the concepts of fuzzy conditional statements and the 
compositional rule of inference, w and z (fuzzy subsets of W and Z, respectively) can be inferred from 
each fuzzy control rule. 

Typically, though, in control problems, the values of the state variables and of the control signals are 
represented by real numbers, not fuzzy sets. Therefore, to convert real information into fuzzy sets, and 
vice versa, it is necessary to convert fuzzy sets into real numbers. These two conversion processes are 
generally called fuzzification and defuzzification, respectively. Specifically, a fuzzification operator has 
the effect of transforming crisp data into fuzzy sets. Symbolically 


x = fuzzifier(x,), (4.27) 


where 
x, is a crisp value of a state variable 
xis a fuzzy set 


Alternatively, a defuzzification operator transforms the outputs of the inference process (fuzzy sets) into 
a crisp value for the control action. That is 


Z, = defuzzifier(z), (4.28) 


where 
z, is a crisp value 
zis a fuzzy membership value 


Referring back to the height example used previously, if a basketball player is about 6.8 ft, then after 
fuzzification, his height membership value is shown in Equation 4.22. Various fuzzification and defuzzi- 
fication techniques are described in the references. Figure 4.9 is a block diagram representation of the 
mechanism of the fuzzy control described above. 


ap Inference 
Fuzzifier 
process 


Defuzzifier 


Control 
System's dynamics action 


State 
variables 


FIGURE 4.9 Block diagram representation of the concept of fuzzy logic control. 
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4.7 Fuzzy Control Design 


4.7.1 DC Motor Dynamics—Assume a Linear Time-Invariant System 


Due to the popularity of dc motors for control applications, a dc motor velocity control will be used to 
illustrate the fuzzy control design approach. Readers are assumed to have a basic background about dc 
motor operations; otherwise, please refer to [14,15]. The fuzzy control will be applied to an actual 
dc motor system to demonstrate the effectiveness of the control techniques. We will briefly describe 
the actual control system setup below. 

The fuzzy controller is implemented on a 486 PC using the LabVIEW graphical programming 
package. The complete system setup is shown in Figure 4.10 and the actual motor control system is 
given in Figure 4.11. The rotation of the motor shaft generates a tachometer voltage, which is then 
scaled by interfacing electronic circuitry. A National Instruments data-acquisition board receives the 
data via an Analog Devices isolation backplane. After a control value is computed, an output current 
is generated by the data acquisition board. The current signal passes through the backplane and is 
then converted to a voltage signal and scaled by the interfacing circuitry before being applied to the 
armature of the motor. Load disturbances are generated by subjecting a disc on the motor shaft to a 
magnetic field. 

For illustration purposes, the control objective concentrates on achieving zero steady-state error 
and smooth, fast response to step inputs. These are popular desired motor performance characteris- 
tics for many industrial applications. The parameters and their numerical values of the dc servomotor 
used for our simulation studies are listed in Table 4.1, obtained by conventional system identification 
techniques. 

The parameters R, and L, are the resistance and the inductance of the motor armature circuit, 
respectively; J and fare the moment of inertia and the viscous-friction coefficient of the motor and load 
(referred to the motor shaft), respectively; K is the constant relating the armature current to the motor 
torque, and K, is the constant relating the motor speed to the dc motor’s back emf. The dc motor has an 
input operating range of [-15,15] volts. 


Interfacing 
circuitry 


Magnet (disturbance generation) 


FIGURE 4.10 Schematic diagram of the experimental dc motor system. 


FIGURE 4.11 Motor control system setup. 
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4.7.2 Fuzzy Control TABLE 4.1 DC Motor 
Parameters 

4.7.2.1 Initial Fuzzy Rules and Membership Function Design > a 

In order to design a fuzzy controller, we will first need to determine the L. 170e-3H 

inputs, outputs, universe of discourse, membership functions, and fuzzy 42.6e-6kg m? 

rules. In this section, the input variables of the fuzzy controller aretheerrors, 47.3e-6N m/rad/s 

(E = e(k)), which is the difference between the dc motor speed and the refer- x 14.7e-3N m/A 

ence speed, and the change in error (CE = e(k) — e(k- 1)), wherekisthetime x, 14,7e-3V s/rad 


index. The output variable of the controller is the change in the control effort 
(CU = u(k) - u(k - 1)). 


The determination of the universe of discourse of the velocity error change EABLE 42. Huey Se 


and the control effort change is based on experience and knowledge about the — 
dc motor. For example, the simulation results of the dc motor performance for PB Positive big 
different control laws based on estimated motor parameters is very helpful for PM Positive medium 
the design of the fuzzy controller. The simulation results will give a rough idea PS Positive small 
about the response of the system, even though it does not give the exact system 2E Zero 
performance. Since the open-loop simulations of the system result in a possible NS Negative small 
velocity range of -500 to 500rad/s, the minimum and maximum possible values NM — Negative medium 
NB Negative big 


that the error can assume are -1000 and 1000rad/s, respectively. Hence, the uni- 
verse of discourse (operating range) of the velocity error spans between —1000 
and 1000 rad/s. Based on these requirements, the maximum value of error change is then set to 5.5rad/s. 
Also, the maximum value for the control effort change is determined to be 1.5 V. The universes of discourse 
of the fuzzy variables is then partitioned into seven quantization levels (fuzzy sets), each being described 
by a linguistic statement such as “big,” “small,” etc., as listed in Table 4.2. The number of partition levels 
chosen is a trade-off between the resolution of the quantization and the complexity of the design problem, 
and is often dependent on the designer’s preference. 

A fuzzy membership function requires assigning a real number in the interval [0,1] to every ele- 
ment in the universe of discourse. This number indicates the degree to which the element belongs to 
a fuzzy set, such as big or smail. Fuzzy membership functions can have different shapes depending on 
the designer’s preference and/or experience. Triangular and trapezoidal shapes are popular because 
of simple computations and the capture of the designers’ fuzzy numbers sense. Again, the choice of 
membership functions is a subjective matter, but prior experience can provide some useful guidelines. 
For example, if the measurable data is disturbed by noise, then the membership functions should be suf- 
ficiently wide to reduce noise sensitivity [23]. Some researchers and engineers also suggest that adjacent 
fuzzy-set values should overlap approximately 25%, and fine-tuning can be achieved by altering this 
overlap percentage. 

Figure 4.12 shows the initial membership functions, which assign a real number in the interval [0,1] to 
every element in the universe of discourse used for the motor control problem. This number indicates the 
degree to which the element belongs to a fuzzy set, such as big or small, used in the fuzzy velocity controller. 

Notice that there is a rule for every possible combination of E and CE that may arise. Since E and CE 
both are partitioned into 7 fuzzy sets, the fuzzy rule base table thus has a total of 49 entries, each cor- 
responding to a different combination of input fuzzy set values. These rules have the form 

rulei: if E= Az, and CE= Aj, then CU=C,, (4.29) 
where A,;, and Ac,, are the fuzzy set values of the antecedent part of rule i for E and CE, respectively. 
Likewise, C, is the fuzzy set value of the consequent part of rule i for CU. 

The fuzzification process, which is the transformation of crisp inputs to fuzzy set outputs, is accom- 
plished by using the popular correlation-product inference method. By the same token, these fuzzy set 
outputs were defuzzified with a centroid computation to generate an exact numerical output as stated in 
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FIGURE 4.12 Initial membership functions used in the fuzzy velocity controller. 


a 


Equation 4.24. With this method, the motor’s current operating point numerical values E°® and CE? are 
required. The inference method can be described mathematically as 


1, =min{t4,, (E) Mace, (CE”)}, (4.30) 


which gives the influencing factor of rule i on the decision-making process, and 


I= Jur (CU)dCU, (4.31) 


gives the area bounded by the membership function [1c (CU); thus 1]; gives the area bounded by the 
membership function [c, (CU) scaled by J; computed in Equation 4.30. 
The centroid of the area bounded by Uc, (CU) is computed as 


CUpe, (CU)dCU 


= (4.32) 
J lc, (CU)dCU 


thus [J, € c; gives the control value contributed by rule i. The control value CU®, which combine the 
control efforts from all N rules, is then computed as 


CU® = Satis (4.33) 


In Equations 4.30 through 4.33, the subscript i indicates the ith rule of a set of N rules. 
For illustration purposes, assume the motor currently has the current operating point E° and CE° and 
assume only two rules are used (thus N = 2) 


if E is PS and CE is ZE then CU = PS, 


and 


if E is ZE and CE is PS then CU = ZE. (4.34) 
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Membership function of E Membership function of CE Membership function of CU 


NS ZE PS PM NS ZE_ PS PM NM NS ZE_ PS PM 


Rule 1 


Rule 2 ee 


OCE®  1=fuc, (CU)dCU 
(a) (b) 6) 


FIGURE 4.13 Graphical representation of the correlation-product inference method. 


0 Cu 

~ 

NM NS ZE_ PS PM ‘ Cu 
KX cu° 


(a) 0 Cu (b) 


FIGURE 4.14 Graphical representation of the center-of-gravity defuzzification method. 


The correlation-product inference method described in (4.30) and (4.31) and the area bounded by the 
inferred membership function [c,(CU) are conceptually depicted in Figure 4.13. 

By looking at rule 1, the membership value of E for PS, Up.(E°) is larger than the membership value of 
CE for ZE, 7,(CE°), therefore 


L= Lze(CE°) 


I, is the area bounded by the membership function PS on CU (the hatched and the shaded areas) in 
Figure 4.13c and /,J, is only the hatched area. c, is computed to give the centroid of I,. The same argu- 
ments also apply to rule 2. The defuzzification process is graphically depicted in Figure 4.14. 

The scaled control membership functions (the hatched areas) from different rules (Figure 4.14a) are 
combined together from all rules, which form the hatched area shown in Figure 4.14b. The centroid 
value of the combined hatched area is then computed, to give the final crisp control value. 


4.7.2.2 PI Controller 


If only experience and control engineering knowledge are used to derive the fuzzy rules, the designers 
will probably be overwhelmed by the degrees of freedoms (number of rules) of the design, and many 
of the rule table entries may be left empty due to insufficient detailed knowledge to be extracted from 
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FIGURE 4.15 PI control surface. 


the expert. To make the design process more effective, it will be helpful to have a structured way for the 
designer to follow in order to eventually develop a proper fuzzy controller. 

In this section, we will illustrate how to take advantage of conventional control knowledge to arrive 
at a fuzzy control design more effectively [16]. The PI controller is one of the most popular conventional 
controllers and will be used in this section as the technique to incorporate a priori knowledge, which 
will eventually lead to the fuzzy controller. 

The velocity transfer functions of the dc motor control can be easily obtained from many text 
books [17]. The velocity transfer function can be derived as 


G,(s)= os) _ K (4.35) 
re @a(8) Jas? + (fla + JRa)s + (fRa + KK)’ 
The general equation for a PI controller is 
u(k) = u(k —1)+ (x, + a Jen + & - x, Jett -1), (4.36) 


where K, and K; can be determined by the root-locus method [17]. For velocity control K; = 0.264 and 
K, = 0.12 are chosen to yield desirable response characteristics, which give an adequate tradeoff between 
the speed of the response and the percentage of overshoot. The PI control surface over the universes of 
discourse of error (E) and error change (CE) is shown in Figure 4.15. 


4.7.2.3 Borrowing PI Knowledge 


The PI control surface is then taken as a starting point for the fuzzy control surface. More specifically, a 
“fuzzification” of the PI control surface yielded the first fuzzy control surface to be further tuned. This 
starting point corresponds to the upper left surface of Figure 4.16 (the membership functions are identi- 
cal in shape and size, and symmetric about ZE). 

The initial fuzzy rule base table (Table 4.3) was specified by “borrowing” values from the PI control 
surface. Since the controller uses the information (E, CE) in order to produce a control signal (CU), the 
control action can be completely defined by a three-dimensional control surface. Any small modifica- 
tion to a controller will appear as a change in its control surface. 
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” 


Increasing “severity” of rules 


Making “finer” membership functions 


FIGURE 4.16 Effects of fine-tuning on the fuzzy control surface. 


TABLE4.3 Initial Fuzzy Rule Base Table 


PB ZE PS PM PB PB PB PB 
PM NS ZE PS PM PB PB PB 
PS NM | NS ZE PS PM PB PB 
CE | ZE NB NM | NS ZE PS PM | PB 
NS NB NB NM | NS ZE PS PM 
NM | NB NB NB NM | NS ZE PS 
NB NB NB NB NB NM | NS ZE 
NB NM | NS ZE PS PM | PB 


For example, the rule of the first row and third column of Table 4.3 (highlighted) corresponds to the 
statement 


if CE = PB and E = NS then CU = PM, 


which indicates that if the error is large and is gradually decreasing, then the controller should produce 
a positive medium-compensating signal. 

The initial fuzzy controller obtained will give a performance similar to the designed PI controller. 
The controller performance can be improved by fine-tuning the fuzzy controller while control signals 
are being applied to the actual motor. In order to fine-tune the fuzzy controller, two parameters can 
be adjusted: the membership functions and the fuzzy rules. Both the shape of membership functions 
and the severity of fuzzy rules can affect the motor performance. In general, making the member- 
ship functions “narrow” near the ZE region and “wider” far from the ZE region can improve the 
controller’s resolution in the proximity of the desired response when the system output is close to the 
reference values, thus improving the tracking performance. Also, performance can be improved by 
changing the “severity” of the rules, which amounts to modifying their consequent part. Figure 4.16 
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FIGURE 4.17 Intermediate membership functions used in the fuzzy velocity controller. 


shows the changes in the fuzzy control surface brought upon by varying the membership functions 
and the fuzzy rules. The fine-tuning process begins with Figure 4.16-1. The initial control surface is 
similar to the PI control, which was used as a starting guideline for the fuzzy controller. The changes 
in control surfaces from the left-hand side to right-hand size signify the change of the shape of mem- 
bership functions. The changes in control surfaces from top to bottom signify the change in rules. 

In order to demonstrate the effect of fine-tuning the membership functions, we show a set of interme- 
diate membership functions (relative to the initial one shown in Figure 4.12) in Figure 4.17 and the final 
membership functions in Figure 4.18. Figures 4.17 and 4.18 show that some fuzzy sets, such as the ZE 
in CU, are getting narrower, which allows finer control in the proximity of the desired response, while 
the wider fuzzy sets, such as the PB in E, permit coarse but fast control far from the desired response. 

We also show an intermediate rule table and the final rule table during the fine-tuning process. 
The rule in the first row and third column (highlighted cell) corresponds to 


if CE = PB and E = NS, then CU = To be determined. 


In the initial rule table (Table 4.3), CU = PM. However, during the fine-tuning process, we found that 
the rule should have different action in order to give better performance. In Table 4.4, the CU becomes 
ZE and the final fuzzy rule table, CU = NS. 


NM NS ZE PS PM PB 
0 1000 
NM NS ZE PS PM PB 
CE 
0 5.5 
NM NS ZE PS PM PB 
Cu 
0 15 


FIGURE 4.18 Final membership functions used in the fuzzy velocity controller. 
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TABLE4.4 Intermediate Fuzzy Rule Base Table 
PB NS NS ZE PB | PB PB PB 
PM NM | NS ZE PB PB PB PB 
PS NM | NS NS PS PM | PB PB 

CE | ZE NB NM | NS ZE | PS PM | PB 
NS NB NB NM | NS | PS PS PM 
NM | NB NB NB NB | ZE PS PM 
NB NB NB NB NB | ZE PS PS 

NB NM | NS ZE | PS PM | PB 
E 

TABLE4.5 Final Fuzzy Rule Base Table 
PB NM | NS NS PB | PB PB PB 
PM NM | NM | NS PB PB PB PB 
PS NB NM | NM | PS PB PB PB 

CE | ZE NB NB NM | ZE | PM | PB PB 
NS NB NB NB NS | PM | PM | PB 
NM | NB NB NB NB | PS PM | PM 
NB NB NB NB NB | PS PS PM 

NB NM | NS ZE | PS PM | PB 
E 
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From Figure 4.16-1 and -9 it can be seen that gradually increasing the “fineness” of the membership 
functions and the “severity” of the rules can bring the fuzzy controller to its best performance level. 
The fine-tuning process is not difficult at all. The fuzzy control designer can easily get a feel of how to 
perform the correct fine-tuning after a few trials. Figure 4.16-9 exhibits the fuzzy control surface, which 
yielded the best results. The membership functions and fuzzy rules, which generated them, are the ones 


of Figure 4.18 and Table 4.5, respectively. 


The performance of the controllers for the dc motor velocity control is shown in Figure 4.19 for two 
different references. The fuzzy controller exhibits better performance than the PI controller because 
of shorter rise time and settling time. The fuzzy controller was thoroughly fine-tuned to yield the best 


performance. 


500 


300 


200 


Angular velocity (rad/s) 


100 


FIGURE 4.19 Velocity control. Two references are shown. 
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4.8 Conclusion and Future Direction 


This chapter outlines the design procedures used in the design of fuzzy controllers for the velocity con- 
trol of a dc motor. A comparison of these controllers with a classical PI controller was discussed in terms 
of the characteristics of the respective control surfaces. The PI control was shown to behave as a special 
case of the fuzzy controller, and for this reason it is more constrained and less flexible than the fuzzy 
controller. A methodology that exploits the advantages of each technique in order to achieve a success- 
ful design is seen as the most sensible approach to follow. The drawback of the fuzzy controller is that it 
cannot adapt. Recently, different researchers, including the author of this paper, are investigating adap- 
tive fuzzy controllers, many of them result in implementing the fuzzy controller in a neural network 
structure for adaptation, which is convenient and has fast computation. In fact, a very interesting area 
of exploration is the possibility of combining the advantages of fuzzy logic with those of artificial neural 
networks (ANN). The possibility of using the well-known learning capabilities of an ANN coupled with 
the ability of a fuzzy logic system to translate heuristic knowledge and fuzzy concepts into real numeri- 
cal values may represent a very powerful way of coming closer to intelligent, adaptive control systems. 
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5.1 Introduction 


‘The fascination of artificial neural networks started in the middle of the previous century. First artificial 
neurons were proposed by McCulloch and Pitts [MP43] and they showed the power of the threshold logic. 
Later Hebb [H49] introduced his learning rules. A decade later, Rosenblatt [R58] introduced the perceptron 
concept. In the early 1960s, Widrow and Holf [WH60] developed intelligent systems such as ADALINE 
and MADALINE. Nilsson [N65] in his book, Learning Machines, summarized many developments of that 
time. The publication of the Mynsky and Paper [MP69] book, with some discouraging results, stopped for 
sometime the fascination with artificial neural networks, and achievements in the mathematical founda- 
tion of the backpropagation algorithm by Werbos [W74] went unnoticed. The current rapid growth in the 
area of neural networks started with the work of Hopfield’s [H82] recurrent network, Kohonen’s [K90] 
unsupervised training algorithms, and a description of the backpropagation algorithm by Rumelhart et al. 
[RHW86]. Neural networks are now used to solve many engineering, medical, and business problems 
[WK00,WB01,B07,CCBC07,KTP07,KT07,MFP07,FP08,JM08,W09]. Descriptions of neural network tech- 
nology can be found in many textbooks [W89,Z92,H99,W96]. 


5.2 The Neuron 


A biological neuron is a complicated structure, which receives trains of pulses on hundreds of excitatory 
and inhibitory inputs. Those incoming pulses are summed with different weights (averaged) during the 
time period [WPJ96]. If the summed value is higher than a threshold, then the neuron itself is generat- 
ing a pulse, which is sent to neighboring neurons. Because incoming pulses are summed with time, the 
neuron generates a pulse train with a higher frequency for higher positive excitation. In other words, if 
the value of the summed weighted inputs is higher, the neuron generates pulses more frequently. At the 
same time, each neuron is characterized by the nonexcitability for a certain time after the firing pulse. 
This so-called refractory period can be more accurately described as a phenomenon, where after excita- 
tion, the threshold value increases to a very high value and then decreases gradually with a certain time 
constant. The refractory period sets soft upper limits on the frequency of the output pulse train. In the 
biological neuron, information is sent in the form of frequency-modulated pulse trains. 
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FIGURE 5.1 Examples of logical operations using McCulloch-Pitts neurons. 


The description of neuron action leads to a very complex neuron model, which is not practical. 
McCulloch and Pitts [MP43] show that even with a very simple neuron model, it is possible to build logic 
and memory circuits. Examples of McCulloch-Pitts’ neurons realizing OR, AND, NOT, and MEMORY 
operations are shown in Figure 5.1. 

Furthermore, these simple neurons with thresholds are usually more powerful than typical logic 
gates used in computers (Figure 5.1). Note that the structure of OR and AND gates can be identical. 
With the same structure, other logic functions can be realized, as shown in Figure 5.2. 

The McCulloch-Pitts neuron model (Figure 5.3a) assumes that incoming and outgoing signals may 
have only binary values 0 and 1. If incoming signals summed through positive or negative weights have 
a value equal or larger than threshold, then the neuron output is set to 1. Otherwise, it is set to 0. 


1 if net=>T ey 
r= 2 
ae 0 if net<T G1) 


where 
T is the threshold 
net value is the weighted sum of all incoming signals (Figure 5.3) 


Awl Awl Awl 

B - A+B+C B +1 AB+BC+CA B +1 ABC 
+ +1 +1 

Cc Cc Cc 


FIGURE 5.2 The same neuron structure and the same weights, but a threshold change results in different logical 
functions. 


n 
net =) W;%; net =) W;X;+Wry1 
i=l f=} 


FIGURE 5.3 Threshold implementation with an additional weight and constant input with +1 value: (a) neuron 
with threshold T and (b) modified neuron with threshold T = 0 and additional weight w,,,,= —t. 
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FIGURE 5.4 ADALINE and MADALINE perceptron architectures. 


The perceptron model has a similar structure (Figure 5.3b). Its input signals, the weights, and the 
thresholds could have any positive or negative values. Usually, instead of using variable threshold, 
one additional constant input with a negative or positive weight can be added to each neuron, as 
Figure 5.3 shows. Single-layer perceptrons are successfully used to solve many pattern classification 
problems. Most known perceptron architectures are ADALINE and MADALINE [WH60] shown in 
Figure 5.4. 

Perceptrons using hard threshold activation functions for unipolar neurons are given by 


sgn(net)+1_|1 ifnet20 
— . 1 ee 5.2 
0 funi(net) 2 i if net <0 62) 
and for bipolar neurons 
1 if net 20 
0 = foip (net) = sgn(net) = Ca Aa shea (5.3) 


For these types of neurons, most of the known training algorithms are able to adjust weights only in 
single-layer networks. Multilayer neural networks (as shown in Figure 5.8) usually use soft activation 
functions, either unipolar 


1 
o= funi(net) = 1+ exp(—Anet) (5.4) 


or bipolar 


2 
1+ exp(—Anet) 


0 = foip(net) = tanh (0.5Anet ) = (5.5) 


These soft activation functions allow for the gradient-based training of multilayer networks. Soft activa- 
tion functions make neural network transparent for training [WT93]. In other words, changes in weight 
values always produce changes on the network outputs. This would not be possible when hard activation 
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FIGURE 5.5 Typical activation functions: hard in upper row and soft in the lower row. 


functions are used. Typical activation functions are shown in Figure 5.5. Note, that even neuron models 
with continuous activation functions are far from an actual biological neuron, which operates with 
frequency-modulated pulse trains [WJPM96]. 

A single neuron is capable of separating input patterns into two categories, and this separation is linear. 
For example, for the patterns shown in Figure 5.6, the separation line is crossing x, and x, axis at points 
X,) and x,9. This separation can be achieved with a neuron having the following weights: w, = 1/x,; 
Ww, = 1/x,) and w, = —1. In general, for n dimensions, the weights are 


One neuron can divide only linearly separated patterns. To select just one region in n-dimensional input 
space, more than n + 1 neurons should be used. 


x) Wy 
o 
} Ww. 
o x» 2 
* o W3 
* o o 
+1 
* o 
ol %20 x Wek 
> 110 
x 1 
+ Wo= 
* %20 
x w3=-1 


FIGURE 5.6 _ Illustration of the property of linear separation of patterns in the two-dimensional space by a single 
neuron. 
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FIGURE 5.7 Neural networks for parity-3 problem. 


5.3 Should We Use Neurons with Bipolar 
or Unipolar Activation Functions? 


Neural network users often face a dilemma if they have to use unipolar or bipolar neurons (see Figure 5.5). 
The short answer is that it does not matter. Both types of networks work the same way and it is very easy to 
transform bipolar neural network into unipolar neural network and vice versa. Moreover, there is no need 
to change most of weights but only the biasing weight has to be changed. In order to change from bipolar 
networks to unipolar networks, only biasing weights must be modified using the formula 


N 
Wis nos{nt Le) (5.6) 
i=1 


While, in order to change from unipolar networks to bipolar networks 


N 


bi i i 
re (5.7) 


Figure 5.7 shows the neural network for parity-3 problem, which can be transformed both ways: 
from bipolar to unipolar and from unipolar to bipolar. Notice that only biasing weights are different. 
Obviously input signals in bipolar network should be in the range from —1 to +1, while for unipolar 
network they should be in the range from 0 to +1. 


5.4 Feedforward Neural Networks 


Feedforward neural networks allow only unidirectional signal flow. Furthermore, most feedforward 
neural networks are organized in layers and this architecture is often known as MLP (multilayer percep- 
tron). An example of the three-layer feedforward neural network is shown in Figure 5.8. This network 
consists of four input nodes, two hidden layers, and an output layer. 

If the number of neurons in the input (hidden) layer is not limited, then all classification problems 
can be solved using a multilayer network. An example of such neural network, separating patterns from 
the rectangular area on Figure 5.9 is shown in Figure 5.10. 

When the hard threshold activation function is replaced by soft activation function (with a gain of 10), 
then each neuron in the hidden layer will perform a different task as it is shown in Figure 5.11 and the 
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FIGURE 5.8 An example of the three-layer feedforward neural network, which is sometimes known also 
as MLP. 
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FIGURE 5.9 Rectangular area can be separated by four neurons. 


x>1 


FIGURE 5.10 Neural network architecture that can separate patterns in the rectangular area of Figure 5.7. 


response of the output neuron is shown in Figure 5.12. One can notice that the shape of the output 
surface depends on the gains of activation functions. For example, if this gain is set to be 30, then acti- 
vation function looks almost as hard activation function and the neural network work as a classifier 
(Figure 5.13a). If the neural network gain is set to a smaller value, for example, equal 5, then the neural 
network performs a nonlinear mapping, as shown in Figure 5.13b. Even though this is a relatively simple 
example, it is essential for understanding neural networks. 
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FIGURE 5.11 Responses of four hidden neurons of the network from Figure 5.10. 


Net Output 


FIGURE 5.12 The net and output values of the output neuron of the network from Figure 5.10. 


Let us now use the same neural network architecture as shown in Figure 5.10, but let us change 
weights for hidden neurons so their neuron lines are located as it is shown in Figure 5.14. This network 
can separate patterns in pentagonal shape as shown in Figure 5.15a or perform a complex nonlinear 
mapping as shown in Figure 5.15b depending on the neuron gains. In this simple example of network 
from Figure 5.10, it is very educational because it lets neural network user understand how neural net- 
work operates and may help to select a proper neural network architecture for problems of different 
complexities. Commonly used trial-and-error methods may not be successful unless the user has some 
understanding of neural network operation. 
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FIGURE 5.13 Response on the neural network of Figure 5.10 with different values of neurons gain: (a) gain = 30 
and network works as classifier and (b) gain = 5 and network perform nonlinear mapping. 


Neuron equations: 


3x+y-3>0 Wy, = 3, Wy9=1, W3=-3 
*+3y-3>0 = Wo) =1, W29=3, Wo3=-3 
x-2>0 Ws, =1, W39=0, W33=-2 


x-2y+2>0 Way =1, Wq2=—2, Wyg=2 


(a) 


FIGURE5.15 Response on the neural network of Figure 5.9 with weights define in Figure 5.13 for different values 
of neurons gain: (a) gain = 200 and network works as classifier and (b) gain = 2 and network perform nonlinear 
mapping. 


The linear separation property of neurons makes some problems especially difficult for neural net- 
works, such as exclusive OR, parity computation for several bits, or to separate patterns on two neigh- 
boring spirals. Also, the most commonly used feedforward neural network may have difficulties to 
separate clusters in multidimensional space. For example, in order to separate cluster in two-dimen- 
sional space, we have used four neurons (rectangle), but it is also possible to separate cluster with three 
neurons (triangle). In three dimensions we may need at least four planes (neurons) to separate space 
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—4*+14+2>0 , 
—x42y-2>0 Weights for the first layer: 


x+y+2>0 wx wy  whias 
3x+y>0 -4 1 2 layerl_neuronl 
-1 2 -2 _— layerl_neuron2 
1 1 2 _— layerl_neuron3 
3 #1 0 layer1_neuron3 


Weights for the second layer: 
wl w2 w3 w4 whias 
O -1 +1 -1 -0.5 J/ayer2_neuron1 
+1 -1 O +41 -1.5 layer2_neuron2 
+1 -1 O O -0.5 Jlayer2_neuron3 


FIGURE 5.16 Problem with the separation of three clusters. 


with tetrahedron. In n-dimensional space, in order to separate a cluster of patterns, there are at least 
n+ 1 neurons required. However, if neural network with several hidden layers are used, then the num- 
ber of neurons needed may not be that excessive. Also, a neuron in the first hidden layer may be used 
for separation of multiple clusters. Let us analyze another example where we would like to design neural 
network with multiple outputs to separate three clusters and each network output must produce +1 only 
for a given cluster. Figure 5.16 shows three clusters to be separated, corresponding equations for four 
neurons and weights for resulted neural network, as shown in Figure 5.17. 


Output for neuron 1 Output for neuron 2 


iN 
iw 


FIGURE5.17_ Neural network performing cluster separation and resulted output surfaces for all three clusters. 
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The example with three clusters shows that often there is no need to have several neurons in the hid- 
den layer dedicated for a specific cluster. These hidden neurons may perform multiple functions and 
they can contribute to several clusters instead of just one. It is, of course, possible to develop separate 
neural networks for every cluster, but it is much more efficient to have one neural network with multiple 
outputs as shown in Figures 5.16 and 5.17. This is one advantage of neural networks over fuzzy systems, 
which can be developed only for one output at a time [WJK99,MW0]1]. Another advantage of neural 
network is that the number of inputs can be very large so they can process signals in multidimensional 
space, while fuzzy systems can handle usually two or three inputs only [WB99]. 

The most commonly used neural networks have the MLP architecture, as shown in Figure 5.8. For 
such a layer-by-layer network, it is relatively easy to develop the learning software, but these networks are 
significantly less powerful than networks where connections across layers are allowed. Unfortunately, 
only very limited number of software were developed to train other than MLP networks [WJ96,W02]. As 
a result, most researchers use MLP architectures, which are far from optimal. Much better results can 
be obtained with BMLP (bridged MLP) architecture or with FCC (fully connected cascade) architecture 
[WHM03]. Also, most researchers are using simple EBP (error backpropagation) learning algorithm, 
which is not only much slower than more advanced algorithms such as LM (Levenberg—Marquardt) 
[HM94] or NBN (neuron by neuron) [WCKD08,HW09,WH10], but also EBP algorithm often is not able 
to train close-to-optimal neural networks [W09]. 
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6.1 Introduction 


Different neural network architectures are widely described in the literature [W89,Z95,W96,WJK99, 
H99,WB01,W07]. The feedforward neural networks allow only for one directional signal flow. 
Furthermore, most of the feedforward neural networks are organized in layers. An example of the three 
layer feedforward neural network is shown in Figure 6.1. This network consists of three input nodes: two 
hidden layers and an output layer. Typical activation functions are shown in Figure 6.2. These continuous 
activation functions allow for the gradient-based training of multilayer networks. Usually it is difficult 
to predict required size of neural networks. Often it is done by trial and error method. Another approach 
would be to start with much larger than required neural network and to reduce its size by applying one 
of pruning algorithms [FF02,FFNO1,FFJC09]. 


6.2 Special Easy-to-Train Neural Network Architectures 


Training of multilayer neural networks is difficult. It is much easier to train a single neuron or a single 
layer of neurons. Therefore, several concepts of neural network architectures were developed where only 
one neuron can be trained at a time. There are also neural network architectures where training is not 
needed [HN87,W02]. This chapter reviews various easy-to-train architectures. Also, it will be shown 
that abilities to recognize patterns strongly depend on the used architectures. 


6-1 
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FIGURE 6.2 Typical activation functions: (a) bipolar and (b) unipolar. 


6.2.1 Polynomial Networks 


Using nonlinear terms with initially determined functions, the actual number of inputs supplied to the 
one layer neural network is increased. In the simplest case, nonlinear elements are higher order polynomial 
terms of input patterns. 

The learning procedure for one layer is easy and fast. Figure 6.3 shows an XOR problem solved using 
functional link networks. Figure 6.4 shows a single trainable layer neural network with nonlinear poly- 
nomial terms. The learning procedure for one layer is easy and fast. 
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Bipolar neuron 


Unipolar neuron 


FIGURE 6.3 Polynomial networks for solution of the XOR problem: (a) using unipolar signals and (b) using 
bipolar signals. 


Outputs 


Polynomial 
terms 


FIGURE 6.4 One layer neural network with nonlinear polynomial terms. 


Note that when the polynomial networks have their limitations, they cannot handle networks with 
many inputs because the number of polynomial terms may grow exponentially. 


6.2.2 Functional Link Networks 


One-layer neural networks are relatively easy to train, but these networks can solve only linearly 
separated problems. One possible solution for nonlinear problems was elaborated by Pao [P89] 
using the functional link network shown in Figure 6.5. Note that the functional link network can 
be treated as a one-layer network, where additional input data are generated off-line using nonlinear 
transformations. 

Note that, when the functional link approach is used, this difficult problem becomes a trivial one. The 
problem with the functional link network is that proper selection of nonlinear elements is not an easy 
task. However, in many practical cases it is not difficult to predict what kind of transformation of input 
data may linearize the problem, so the functional link approach can be used. 


6.2.3 Sarajedini and Hecht-Nielsen Network 


Figure 6.6 shows a neural network which can calculate the Euclidean distance between two vectors x 
and w. In this powerful network, one may set weights to the desired point w in a multidimensional 
space and the network will calculate the Euclidean distance for any new pattern on the input. The dif- 
ficult task is the calculate ||x||’, but it can be done off-line for all incoming patterns. A sample output for 
a two-dimensional case is shown in Figure 6.7. 
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FIGURE 6.6 Sarajedini and Hecht-Nielsen neural network. 


NS 


FIGURE 6.7 Output of the Sarajedini and Hecht-Nielsen network is proportional to the square of Euclidean 
distance. 
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6.2.4 Feedforward Version of the Counterpropagation Network 


The counterpropagation network was originally proposed by Hecht-Nilsen [HN87]. In this chapter, a 
modified feedforward version as described by Zurada [Z92] is discussed. This network, which is shown 
in Figure 6.8, requires the number of hidden neurons to be equal to the number of input patterns, or, 
more exactly, to the number of input clusters. 

When binary input patterns are considered, then the input weights must be exactly equal to the input 
patterns. In this case, 


net = x'w =(n—2HD(x,w)) (6.1) 


where 
nis the number of inputs 
ware weights 
x is the input vector 
HD(x,w) is the Hamming distance between input pattern and weights 


In order that a neuron in the input layer is reacting just for the stored pattern, the threshold value for 
this neuron should be 


Wa =—(n-1) (6.2) 


If it is required that the neuron must react also for similar patterns, then the threshold should be set to 
Wry = —(n — (1 + HD)), where HD is the Hamming distance defining the range of similarity. Since for 
a given input pattern, only one neuron in the first layer may have the value of one and the remaining 
neurons have zero values, the weights in the output layer are equal to the required output pattern. 

The network, with unipolar activation functions in the first layer, works as a look-up table. When the 
linear activation function (or no activation function at all) is used in the second layer, then the network 
also can be considered as an analog memory (Figure 6.9) [W03,WJ96]. 

The counterpropagation network is very easy to design. The number of neurons in the hidden layer 
should be equal to the number of patterns (clusters). The weights in the input layer should be equal to the 
input patterns and, the weights in the output layer should be equal to the output patterns. This simple 
network can be used for rapid prototyping. The counterpropagation network usually has more hidden 
neurons than required. 


Kohonen 


Normalized inputs 
Outputs 


Unipolar circuits 
neurons 


FIGURE6.8 Counterpropagation network. 
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FIGURE6.9 Counterpropagation network used as analog memory with analog address. 
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FIGURE6.10 Learning vector quantization. 


6.2.5 Learning Vector Quantization 


At learning vector quantization (LVQ) network (Figure 6.10), the first layer detects subclasses. The second 
layer combines subclasses into a single class. First layer computes Euclidean distances between input 
pattern and stored patterns. Winning “neuron” is with the minimum distance. 


6.2.6 WTA Architecture 


The winner-take-all (WTA) network was proposed by Kohonen [K88]. This is basically a one-layer 
network used in the unsupervised training algorithm to extract a statistical property of the input data. 
At the first step, all input data is normalized so that the length of each input vector is the same, and 
usually equal to unity. The activation functions of neurons are unipolar and continuous. ‘The learning 
process starts with a weight initialization to small random values. 
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FIGURE 6.11 Neuron as the Hamming distance classifier. 


Let us consider a neuron shown in Figure 6.11. If inputs are binaries, for example X=[1, —-1, 1, -1, -1], 
then the maximum value of net 


5 


net = Sam, =xw’ (6.3) 


i=l 


is when weights are identical to the input pattern W=[1, -1, 1, -1, -1]. The Euclidean distance between 
weight vector W and input vector X is 


|W —X| =m — on)? $ (2 — 22)? + (Wn Hen)? (6.4) 
|w-x|= [0 -x) (6.5) 
|w —x|=/ww! -2wx! + xx" (6.6) 


When the lengths of both the weight and input vectors are normalized to value of 1 
[X[=1 and [w[=1 (6.7) 


then the equation simplifies to 
|w -x|=V2-2wx" (6.8) 


Please notice that the maximum value of net value net = 1 is when W and X are identical. 
Kohonen WTA networks have some problems: 


1. Important information about length of the vector is lost during the normalization process 
2. Clustering depends on 

a. Order of patterns applied 

b. Number of initial neurons 

c. Initial weights 


6.2.7 Cascade Correlation Architecture 


The cascade correlation architecture (Figure 6.12) was proposed by Fahlman and Lebiere [FL90]. The pro- 
cess of network building starts with a one-layer neural network and hidden neurons are added as needed. 

In each training step, the new hidden neuron is added and its weights are adjusted to maximize the 
magnitude of the correlation between the new hidden neuron output and the residual error signal on 
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FIGURE6.12 Cascade correlation architecture. 


the network output that we are trying to eliminate. The correlation parameter S defined in the following 
equation must be maximized: 


O 


s-¥ vw, ~V)(Epo—E,) (6.9) 
p=l 


o=1 


where 
O is the number of network outputs 
Pis the number of training patterns 
V, is output on the new hidden neuron 
E,, is the error on the network output 


By finding the gradient, AS/Aw,, the weight adjustment for the new neuron can be found as 


Aw; = > Yo. (Epo — Eo) fy’Xip (6.10) 


o=l p=l 


The output neurons are trained using the delta (backpropagation) algorithm. Each hidden neuron is 
trained just once and then its weights are frozen. The network learning and building process is com- 
pleted when satisfactory results are obtained. 


6.2.8 Radial Basis Function Networks 


The structure of the radial basis function (RBF) network is shown in Figure 6.13. This type of network 
usually has only one hidden layer with special “neurons.” Each of these “neurons” responds only to the 
inputs signals close to the stored pattern. 

The output signal h, of the ith hidden “neuron” is computed using the formula: 


(6.11) 


Note that the behavior of this “neuron” significantly differs from the biological neuron. In this 
“neuron,” excitation is not a function of the weighted sum of the input signals. Instead, the distance 
between the input and stored pattern is computed. If this distance is zero, then the “neuron” responds 
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FIGURE6.13 Radial basis function networks. 


with a maximum output magnitude equal to one. This “neuron” is capable of recognizing certain 
patterns and generating output signals being functions of a similarity. 


6.2.9 Implementation of RBF Networks with Sigmoidal Neurons 


The network shown in Figure 6.14 has similar property (and power) like RBF networks, but it uses 
only traditional neurons with sigmoidal activation functions [WJ96]. By augmenting the input space 
to another dimension the traditional neural network will perform as a RBF network. Please notice that 
this additional transformation can be made by another neural network. As it is shown in Figure 6.15, 
2 first neurons are creating an additional dimension and then simple 8 neurons in one layer feedforward 
network can solve the two spiral problem. Without this transformation, about 35 neurons are required 
to solve the same problem with neural network with one hidden layer. 


6.2.10 Networks for Solution of Parity-N Problems 


The most common test benches for neural networks are parity-N problems, which are considered to be 
the most difficult benchmark for neural network training. The simplest parity-2 problem is also known 
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FIGURE6.14 Transformation, which is required to give a traditional neuron the RBF properties. 
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FIGURE 6.15 Solution of two spiral problem using transformation from Figure 6.14 implemented on two 
additional neurons. 


as the XOR problem. The larger the N, the more difficult it is to solve. Even though parity-N problems are 
very complicated, it is possible to analytically design neural networks to solve them [WHM03,W09]. Let 
us design neural networks for the parity-7 problem using different neural network architectures with 
unipolar neurons. 

Figure 6.16 shows the multilayer perceptron (MLP) architecture with one hidden layer. In order to 
properly classify patterns in parity-N problems, the location of zeros and ones in the input patterns 
are not relevant, but it is important how many ones are in the patterns. Therefore, one may assume 
identical weights equal +1 connected to all inputs. Depending on the number of ones in the pattern, 
the net values of neurons in the hidden layer are calculated as a sum of inputs times weights. The 
results may vary from 0 to 7 and will be equal to the number of ones in an input pattern. In order to 
separate these eight possible cases, we need seven neurons in the hidden layer with thresholds equal 
to 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, and 6.5. Let us assign positive (+1) and negative (-1) weights to outputs 
of consecutive neurons starting with +1. One may notice that the net value of the output neuron 
will be zero for patterns with an odd number of ones and will be one with an even number of ones. 
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All weights = 1 
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FIGURE 6.16 MLP architecture for parity-7 problem. The computation process of the network is shown in 
the table. 


The threshold of +0.5 of the last neuron will just reinforce the same values on the output. The signal 
flow for this network is shown in the table of Figure 6.16. 

In summary, for the case of a MLP neural network the number of neurons in the hidden layer is equal 
to N =7 and total number of neurons is 8. For other parity-N problems and MLP architecture: 


Number of neurons = N +1 (6.12) 


Figure 6.17 shows a solution with bridged multilayer perceptron (BMLP) with connections across layers. 
With this approach the neural network can be significantly simplified. Only three neurons are needed 
in the hidden layer with thresholds equal to 1.5, 3.5, and 5.5. In this case, all weights associated with 
outputs of hidden neurons must be equal to —2 while all remaining weights in the network are equal 
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All weights = 1 
Weights = —2 


Number of ones in a pattern 


net (from inputs only) 


outl r=15 
out2 T=3.5 
out3 T=55 


net1 = nel —2*(out1 + out2 + out3) 


out4 (of output neuron) T=0.5 


FIGURE 6.17. BMLP architecture with one hidden layer for parity-7 problem. The computation process of the 
network is shown in the table. 


to +1. Signal flow in this BMLP network is shown in the table in Figure 6.17. With bridged connections 
across layers the number of hidden neurons was reduced to (N — 1)/2 = 3 and the total number of 
neurons is 4. For other parity-N problems and BMLP architecture: 


cilia: +1 for odd parity 
Number of neurons = Pe (6.13) 
= +1 for even parity 


Figure 6.18 shows a solution for the fully connected cascade (FCC) architecture for the same parity-7 
problem. In this case, only three neurons are needed with thresholds 3.5, 1.5, and 0.5. The first neuron 
with threshold 3.5 is inactive (out = 0) if the number of ones in an input pattern is less than 4. If the num- 
ber of ones in an input pattern is 4 or more then the first neuron becomes active and with —4 weights 
attached to its output it subtracts —4 from the nets of neurons 2 and 3. Instead of [0 1 2 3 45 6 7] these 
neurons will see [0 1 23 0 1 2 3]. The second neuron with a threshold of 1.5 and the —2 weight associated 
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Z 


Number of ones in a pattern 


net (from inputs only) 


out] (of first neuron) T=3.5 
net2 = net1—4*outl 


out2 (of second neuron) T=15 


net3 = net2 — 2*out2 


out3 (of output neuron) T=0.5 


FIGURE 6.18 FCC architecture for parity-7 problem. The computation process of the network is shown in 
the table. 


with its output works in such a way that the last neuron will see [0 101010 1] instead of [01230 


1 2 3]. For other parity-N problems and FCC architecture: 


Number of neurons = [log 2(N + | (6.14) 


6.2.11 Pulse-Coded Neural Networks 


Commonly used artificial neurons behave very differently than biological neurons. In biological neu- 
rons, information is sent in a form of pulse trains [WJPM96]. As a result, additional phenomena such 
as pulse synchronization play an important role and the pulse coded neural networks are much more 
powerful than traditional artificial neurons. They then can be used very efficiently for example for image 
filtration [WPJ96]. However, their hardware implementation is much more difficult [OW99,WJK99]. 


6.3 Comparison of Neural Network Topologies 


With the design process, as described in Section 6.2 it is possible to design neural networks to arbitrarily 
large parity problems using MLP, BMLP, and FCC architectures. Table 6.1 shows comparisons of mini- 
mum number of neurons required for these three architectures and various parity-N problems. 

As one can see from Table 6.1 and Figures 6.16 through 6.18, the MLP architectures are the least efficient 
parity-N application. For small parity problems, BMLP and FCC architectures give similar results. For 
larger parity problems, the FCC architecture has a significant advantage, and this is mostly due to more 
layers used. With more layers one can also expect better results in BMLP, too. These more powerful neural 
network architectures require more advanced software to train them [WCKD07,;WCKD08,WH10]. Most 
of the neural network software available in the market may train only MLP networks [DB04,H W09]. 
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TABLE 6.1 Minimum Number of Neurons Required 
for Various Parity-N Problems 


Parity-8  Parity-16  Parity-32 —Parity-64 


# inputs 8 16 32 64 

# patterns 256 65536 4.294e+9 1.845e+19 
MLP (one hidden layer) 9 17 33 65 

BMLP (one hidden layer) 5 9 17 33 

FCC 4 5 6 7 


6.4 Recurrent Neural Networks 


In contrast to feedforward neural networks, recurrent networks neuron outputs could be connected 
with their inputs. Thus, signals in the network can continuously be circulated. Until now, only a limited 
number of recurrent neural networks were described. 


6.4.1 Hopfield Network 


The single layer recurrent network was analyzed by Hopfield [H82]. This network shown in Figure 6.17 
has unipolar hard threshold neurons with outputs equal to 0 or 1. Weights are given by a symmetrical 
square matrix W with zero elements (w;, = 0 for i=j) on the main diagonal. The stability of the system is 
usually analyzed by means of the energy function 


E=-> yyw, (6.15) 


It was proved that during signal circulation the energy E of the network decreases and system converges 
to the stable points. This is especially true when values of system outputs are updated in the asynchro- 
nous mode. This means that at the given cycle, only one random output can be changed to the required 
value. Hopfield also proved that those stable points to which the system converges can be programmed 
by adjusting the weights using a modified Hebbian [H49] rule 


Awy = Awji _ (2v; - 1) (2; = 1) (6.16) 


Such memory has limited storage capacity. Based on experiments, Hopfield estimated that the maxi- 
mum number of stored patterns is 0.15N, where N is the number of neurons. 


6.4.2 Autoassociative Memory 


Hopfield [H82] extended the concept of his network to autoassociative memories. In the same network 
structure as shown in Figure 6.19, the bipolar neurons were used with outputs equal to -1 of +1. In this 
network pattern, s,, are stored into the weight matrix W using autocorrelation algorithm 


M 
W= Y susi —MI (6.17) 


m=1 


where 
M is the number of stored pattern 
I is the unity matrix 
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FIGURE6.19 Autoassociative memory. 


Note that W is the square symmetrical matrix with elements on the main diagonal equal to zero 

(w;, for i=j). Using a modified formula, new patterns can be added or subtracted from memory. When 

such memory is exposed to a binary bipolar pattern by enforcing the initial network states, then after sig- 

nal circulation the network will converge to the closest (most similar) stored pattern or to its complement. 
This stable point will be at the closest minimum of the energy function 


E(v)= = Sv Wy (6.18) 


Like the Hopfield network, the autoassociative memory has limited storage capacity, which is estimated 
to be about M,,,,,=0.15N. When the number of stored patterns is large and close to the memory capacity, 


Si 


"NG 


L/\ 


FIGURE 6.20 Bidirectional autoassociative memory. 
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the network has a tendency to converge to spurious states which were not stored. These spurious states 
are additional minima of the energy function. 


6.4.3 BAM—Bidirectional Autoassociative Memories 


The concept of the autoassociative memory was extended to bidirectional associative memories BAM 
by Kosko [K87]. This memory shown in Figure 6.20 is able to associate pairs of the patterns a and b. 

This is the two layer network with the output of the second layer connected directly to the input of 
the first layer. The weight matrix of the second layer is W’ and it is W for the first layer. The rectangular 
weight matrix W is obtained as the sum of the cross correlation matrixes 


M 
W- Y'anbn (6.19) 
m=1 


where 
M is the number of stored pairs 
a,, and b,, are the stored vector pairs 


The BAM concept can be extended for association of three or more vectors. 
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7.1 Introduction 


Neural networks are characterized by being massively parallel in their architecture and that they use a 
learning paradigm rather than being programmed. These make neural networks very useful in several 
areas where they can “learn from examples” and then process new data in a redundant way. In order to 
take advantage of these features, novel silicon architectures must be conceived. While “computer simu- 
lation” of neural networks brings interesting results, they fail to achieve the promise of natively parallel 
implementations. 


7.2 Radial-Basis-Function Networks 


Artificial neural networks process sets of data, classifying the sets according to similarities between 
the sets. A very simple neural network is shown in Figure 7.1. There are more than a dozen differ- 
ent architectures of such neural networks, as well as many ways to process the input data sets. The 
two inputs in Figure 7.1 may or may not be weighted before reaching a node (the circle) for specific 
processing. The radial basis function (RBF) represents one such processing method. 

For an RBF neural network, the weights following the input layer are all set like one. After the 
processing, @, and @,, the results are weighted, w, and w,, before entering the final summing node 
in the output layer. Also note the fixed input, +1, with weight b, the use of which will become clear 
below. 


7-1 
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Fixed input = +1 


x) 


*2 


Input layer Hidden layer Output layer 


FIGURE 7.1 A simple RBF network. Observe that in most cases other than for RBF networks, the inputs are 
weighted before reaching the nodes in the hidden layer. 


7.2.1 Radial Basis Function 


As indicated in Figure 7.1, the RBFs (@, and @,) are the hidden layer in the neural network. We define the 
RBF as a real valued function of the form 


(x) = (|| x||) (7.1) 
It thus depends only on the distance from the origin, or from a point m,, 
(x, m,) = @(||x—m; |]) (7.2) 


The norm || -|| is usually Euclidean distance. An RBF is typically used to build up function approxima- 
tions like 


N 
o(x) = >) w,(|[x-m, |I) (7.3) 
i=l 


The RBFs are each associated with different centers, m,, and weighted by a coefficient, w,. 
Commonly used types of RBFs include (r = || x — m,||) 


Gaussian: 
r 
@(r) = exp — forsomeo>0O and reR (7.4) 
Multiquadrics: 
9) =(vF +2) forsomec>0O and reR (7.5) 
Inverse multiquadrics: 
QO(r)= as forsomec>0O and reR (7.6) 
r+c 
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7.2.2 Use of the RBF: An Example 


A typical use of an RBF is to separate complex patterns, that is, the elements of the pattern are not 
linearly separable. The use of RBFs may solve the problem in one, or if needed, in two steps: 


a. Performing a nonlinear transformation, @,(x) on the input vectors x, to a hidden space where the 
(,(x)s may be linearly separable. The dimension of the hidden space is equal to the number of 
nonlinear functions @,(x) taking part in the transformations. The dimension of the hidden space 
may, as a first try, be set equal to the dimension of the input space. 

If the @,(x)s of the hidden space do not become linearly separable, then one may. 

b. Choose a higher dimension on the hidden space, than that of the input space. The “Cover’s theo- 
rem on the separability of patterns” [19] suggests this will increase the probability of achieving 
linear separability of the elements in the hidden layer. 


An RBF network consists then of an input layer, a hidden layer of nonlinear transfer functions, the 
,(x)s, which then are input to an ordinary feed-forward perceptron. Figure 7.1 shows a two input RBF 
network, with two nodes also in the hidden layer. 

In the process of separating complex pattern, the use of an RBF network will consist of choosing a 
suitable radial function for the hidden layer, here @, and @,, and furthermore finding a working set of 
weights, here w,, w., and b. 

Below is an example of this procedure applied on the called XOR problem [19]. In the XOR problem 
there are four points or patterns in the two dimensional x, — x, plane: (1,1), (0,1), (0,0), and (1,0). The 
requirement is to construct an RBF network that produces a binary output 0 in response to the input 
pattern (1,1) or (0,0), and a binary output 1 to the two other input patterns (0,1) or (1,0). 

In Figure 7.2, the four points are marked on the x, — x, plane. One easily observes the impossibility of 
using a straight line to divide the points according to the requirement. 

In this example, the RBF used will be a Gaussian 


aie i eae San 


The Gaussian centers t, and t, are 


t,= [1,1] 
t, = [0,0] 
The relationships between inputs x, and output d, (j = 1, 2, 3, 4) are 


x d, 


(, 1) 0 
(0, 1) 1 
(0, 0) 0 
(1, 0) 1 
° 
The outputs from the hidden layer will then be modified by the 
@ 0.1) @ (11) weights w, and w,, such that 
2 
! Yi wG(llx;— tll) +b=4, 
H i=l 
| (0,0) (1,0) 
Pe eee ee Re wine 
FIGURE 7.2 The four patterns of i= 1,2 refers to the two nodes (@,) in the hidden layer 
the XOR problem. j=1,2,3, 4 refers to the four input patterns 
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The detail in finding the weights is demonstrated in the MATLAB® program below. 
In MATLAB notation, the input value x, the corresponding output value d, and the centre values f: 


% The input val ues 

x =[11;0 1; 0 0;1 OJ]; 
% The target val ues 
d=[0101]'; 


% Two neurons wth Gaussian centres 


t =[1 1; 0 0]; 


With the elements in the G-matrix written as 


gj =G((|x;-t,|l), 


j=l, 2, 3, 45 


i=1, 2 


that are elements in a 4 by 2 matrix. One has in detail MATLAB notation for the g;, 


% The Gaussian natrix (4 by 2) 


for i=l:2 
for j =1:4 


a= X( Jy 1) tl, U2 + dp 2) $014.2) ) 0 23 


g(j ,i ) =exp(-a); 
end 
end 


with the result 


where 


g — 

1. 0000 
0. 3679 
0. 1353 
0. 3679 


0. 1353 
0. 3679 
1. 0000 
0. 3679 


£1, is the output from @, (Figure 7.1) based on input x, = (1,1) 
£17 is the output from @,, based on the same input x, = (1,1) 


Thus, the g-matrix displays the transformation from the input plane to the (@,@,) plane. A simple plot 
may show, by inspection, it is possible to separate the two groups by a straight line: 


for j=1:4 
pl ot(g(j, 1), 9(j,2),'or') 
hol d on 

end 

axi s([-0.2 1.2 -0.2 1.2]) 

hol d of f 

grid 


Now, before adjusting the w;s, the bias b has to be included in the contribution to the output node: 


% Prepare wth bias 
b=+1111]'; 
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so the new weight matrix is 


% The new Gaussi an 
G=[g b] 


with the result 


000 0. 1353 
0.3679 0.3679 
0.1353 1.0000 
0.3679 0.3679 


1. 0000 
1. 0000 
1. 0000 
1. 0000 
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The perceptron part of the network, the output from the two RBFs together with the bias, may now be 


presented as 


Gw=d 


where w = [w, w, b]" is the weight vector, which is to be calculated, and 


d=(0101]° 


is the desired output. T indicates a transpose vector. One observes that the matrix G is not square (an 
overdetermined problem) and there is no unique inverse matrix G. To overcome this, one may use the 


pseudoinverse solution [19]: 


w=G*d=(G'G)'G'd 


So, the following needs to be calculated in MATLAB: 


% The transpose: G 


gt =G; 

YAN inverse: («G'e)-} 
gg =gt*G 

gi =inv(gg); 

YI he new "Gaussian" G 
gp = gi *gt; 


which gives 


gp = 


1.8296 -1.2513 0.6731 
0.6731 -1.2513 1.8296 


-0.9207 1.4207 —-0. 9207 


Yhe wei ghts 
w = gp*d 
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which gives 


and which may be checked 
cd =G* w 
which gives 


cd = 
—0. 0000 
1. 0000 
0. 0000 
1. 0000 


and is identical to the original desired output d. 


7.2.3 Radial-Basis-Function Network 


The approximation by the sum, mentioned above (7.3), can also be interpreted as a rather simple single- 
layer type of artificial neural network. Here, the RBFs taking on the role of the activation functions of 
the neural network. It can be shown that any continuous function on a compact interval can be inter- 
polated with arbitrary accuracy by a sum of this form, given that a sufficiently large number N of RBFs 
is used. 

The strong point of the RBF-net is its capability to separate entangled classes. While the RBF-net will 
do quite well for the so-called double spiral (Figure 7.3), conventional nets will not do so well (as will be 
shown later). 


FIGURE 7.3 The twin spiral problem is hard to solve with a conventional feed-forward neural network but easily 
solved for the RBF-net. 
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FIGURE 7.4 Results of tests using the double-spiral input. The results are shown for an RBF-DDA case (left), 
MLP-PROP (middle), and a decision-tree type of neural network (right). (From Ref. [15].) 


7.2.4 Learning Paradigms 


The most well-known learning paradigm for feed-forward neural networks is the backpropagation (BP) 
paradigm. It has several salient pros and cons. For the RCC networks, however, the RCC or restricted 
Coulomb energy (RCE-P) algorithms are generally used. They have the feature that one does not need 
to fix the number of hidden neuron beforehand. Instead, these are added during training “as needed.” 
The big disadvantage with these paradigms is that the standard deviation is adjusted with one single 
global parameter. The RCE training algorithm was introduced by Reilly, Cooper, and Elbaum (hence 
the RCE [18]). The RCE and its probabilistic extension the RCE-P algorithm take advantage of a growing 
structure in which hidden units, as mentioned, are only introduced when necessary. The nature of both 
algorithms allows training to reach stability much faster than most other algorithms. But, again, RCE-P 
networks do not adjust the standard deviation of their prototypes individually (using only one global 
value for this parameter). 

This latter disadvantage is taken care of in the Dynamic Decay Adjustment-algorithm or the DDA- 
algorithm. It was introduced by Michael R. Berthold and Jay Diamond [13] and yields more efficient 
networks at the cost of more computer mathematics. It is thus of importance to consider this when imple- 
menting any neural network system and considering the task to solve. In a task where the groups are more 
or less interlaced, the DDA is clearly superior as shown in the “double-spiral test,” in Figure 7.4. It could 
also be mentioned that the DDA took only four epochs to train, while the RPROP (short for resilient back- 
propagation [20]), is a learning heuristics for supervised learning in artificial neural networks. It is similar 
to the Manhattan update rule. (Cf., for example, Ref. [20] for details.) The neural network was trained for 
40,000 epochs, still with worse result [15]. In spite of the superior results for the RBF-net, the multilayer 
perceptrons (or MLPs for short) are still the most prominent and well-researched class of neural networks. 


7.3 Implementation in Hardware 


Generally speaking, very few implementations of neural networks have been done outside the university 
world, where such implementations have been quite popular. A few chips are or have been available, of 
which the most well known are the Intel Electrically Trainable Neural Network (ETANN), which used a 
feed-forward architecture, and the IBM ZISC [3,4], which used an RBF type of architecture as described 
above. In 2008, an evolution of the ZISC, CogniMem (for cognitive memory) has been introduced by 
Recognetics Inc. 

The ETANN chip and 64 neurons, which could be time multiplexed, that is, used two times. It was 
possible to connect up to eight ETANN chips, which thus limited the number of neurons. The ETANN 
was probably developed by Intel in collaboration with the U.S. Navy at China Lake and never entered 
production. However, it was used in several applications in high-energy physics. 
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FIGURE7.5 Comparison between two neural network chips, the ETANN and the ZISC to the number of neurons 
and connections in the brains of some animals. 


The ZISC chip was developed by a team of IBM France, based on a hardware architecture devised by 
Guy Paillet. This hardware architecture is at the intersection of parallel processing technology and the 
RBF model (Figure 7.5). 

Comparing the ETANN and ZISC to biological species may be wrong, but anyhow sets the silicon in 
perspective to various animals (Figure 7.6). 


IBM ZISC036 IBM ZISC036 IBM ZISCO36 


DT | 
TUTTE TTT TT 
TUTTE TTT TPE TTT 


9403 9403 9403 
8A102 8A102 8A102 


UTE ETT 
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FIGURE 7.6 Three ZISC chips were conveniently mounted on a PCMCIA card. 
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7.3.1 Implementation for High-Energy Physics 


Although the number of cascaded-ZISC chips may be application or bus dependent, there should be no 
problem using up to 10 chips. Building larger networks is a matter of grouping small networks together 
by placing re-powering devices where needed [1]. Bus and CPU interface examples are found in the 
ZISC036 Data Book [1]. 

The VMEbus and VME modules are used by researchers in physics. A multipurpose, experimental 
VMEbus card was used in a demonstration (cf. Figure 7.7 and also Ref. [5] for details on an IBM/ISA 
implementation) The VME-board holds four piggy-back PCBs with one chip each (cf. Figure 7.7). The 
PCBs, holding the ZISCs, are made to carry another card on top using Euro-connectors. Hence up to 
40 ZISCs could, in principle, be mounted in four “ZISC-towers” on the VME-card. 

In an early study of the ZISC036 using a PC/ISA-board [5], the computer codes were written in 
Borland C++ under DOS and Windows. In the VME implementation, we rely on a VME to SBus hard- 
ware interface and pertinent software. This software is written using the GNU C++ and the VMIC SBus 
interface library. 

As mentioned previously, a neural network of the RBF-type [9-11] is somewhat different from more 
conventional NNW architectures. In very general terms, the approach is to map an N-dimensional space 
by prototypes. Each of these prototypes is associated with a category and an influence field representing 
a part of the N-dimensional space around the prototype. Input vectors within that field are assigned 
the category of that prototype. (In the ZISC implementation, the influence fields are represented by 


FIGURE 7.7 Schematic layout of the VME/ZISC036 board. The lower part shows the piggy-back area, which can 
hold 4-40 ZISC chips. 
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hyper-polygons rather than hyper-spheres as in a more theoretical model. Two user-selectable distance 
norms are supported by the chip [1]). Several prototypes can be associated with one category and influ- 
ence fields may overlap. 

There are several learning algorithms associated with the RBF-architecture, but the most common 
ones are the RCE [9] and RCE-like ones. The one used by the ZISC chip is “RCE-like.” A nearest-neigh- 
bor evaluation is also available. 

We have added the Intel ETANN and the Bellcore CLNN32/64 neural nets [12] to the ZISC036 with a 
LHC physics “benchmark test.” The inputs in this test are the moments and transverse moments of the 
four leading particles, obtained in a simulation of a LHC search for a heavy Higgs (cf. Ref. [12] for details). 
Two-dimensional plots of these moments (p versus p,), for the leading particle, are shown in Figure 7.8. 

Although only some preliminary results have been obtained, it is fair to say that a system with eight 
inputs and just 72 RBF-neurons could recognize the Higgs to a level of just above 70% and the back- 
ground to about 85%. This is almost as good as the CLNN32/64 chips discussed in Ref. [12]. Further 
details and results will be presented at the AIHENP-95 [13]. 

In 2008, a powerful evolution of the ZISC, CM1K (CogniMem 1024 neurons), was released by 
CogniMem Ltd. based on a new implementation by Anne Menendez and Guy Paillet. The neuron den- 
sity as well as additional features such as low power, high speed, and small size allow CMIK to operate 
both as “near sensor trainable pattern recognition” and as a pattern recognition server, taking espe- 
cially advantage or virtually unlimited expendability. A single CM1K chip achieves 2.6 Billion CUPS 


BACKGROUND 


FIGURE 7.8 Examples of inputs. The input neurons 1 and 2 get their values from either of the two histograms 
shown here (there are three additional histograms for inputs 3 and 8). 
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FIGURE 7.9 Star tracker results as discussed in the text. The shaded area in the figure to the right indicates where 
most star tracker data falls, showing that the RBF-net does very well when it comes to position jitter robustness. 


with a 256kB storage capacity. An architecture featuring 1024 CMIK will reach 2.6 Trillion CUPS 
with 256 MB storage capacity. Practically, such a system will be able to find the nearest neighbor of one 
256 bytes vector versus one millions in less than 10 us. Finding the subsequent closest neighbor will take 
an additional 1.3 Us per neighbor vector. 


7.3.2 Implementing a Star Tracker for Satellites 


A star tracker is an optical device measuring the direction to one or more stars, using a photocell or 
solid-state camera to observe the star. One may use a single star, and of the most used are Sirius (the 
brightest) and Canoponus. However, for more complex missions, entire star field databases are used to 
identify orientation. 

Using the ZISC chip to implement a star tracker turned out to be very successful [17]. The RMS 
brightness jitter as well as the position jitter in pixels turned out to be equally good or even better than 
most conventional star tracker of same complexity. This is shown in Figure 7.9. 


7.3.3 Implementing an RBF network in VHDL 


This may seem as a very good idea and has been done in several university projects [14]. Here, we simply 
refer one diploma report [16]. This report uses the VHDL and Xilinx ISE 4.1i. Explanation about how 
components were created in Xilinx and generated in CoreGenerator. It also contains VHDL code, wir- 
ing diagram, and explanation how to work with Xilinx in creating new projects and files. All solutions 
are described and presented together with component diagrams to show what the various kinds of com- 
ponents do and how they connect to each other. 
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8.1 Introduction 


The scope of applications of mathematical models in contemporary industrial systems is very broad 
and includes system design [12], control [5,6,27], and diagnosis [15,16,19,22-24]. The models are usu- 
ally created on the basis of the physical laws describing the system behavior. Unfortunately, in the 
case of most industrial systems, these laws are too complex or unknown. Thus, the phenomenological 
models are often not available. In order to solve this problem, the system identification approach can 
be applied [22,25]. 

One of the most popular nonlinear system identification approaches is based on the application of 
artificial neural networks (ANNs) [3,11,21]. These can be most adequately characterized as computa- 
tional models with particular properties such as generalization abilities, the ability to learn, parallel data 
processing, and good approximation of nonlinear systems. However, ANNs, despite the small number 
of assumptions in comparison to analytical methods [4,9,15,22], still require a significant amount of a 
priori information about the model’s structure. Moreover, there are no efficient algorithms for selecting 
structures of classical ANNs, and hence many experiments should be carried out to obtain an appro- 
priate configuration. Experts should decide on the quality and quantity of inputs, the number of layers 
and neurons, as well as the form of their activation function. The heuristic approach that follows the 
determination of the network architecture corresponds to a subjective choice of the final model, which, 
in the majority of cases, will not approximate with the required quality. 

To tackle this problem, the Group Method of Data Handling (GMDH) approach can be employed 
[14,19]. The concept of this approach is based on iterative processing of an operation defined as a sequence 
leading to the evolution of the resulting neural network structure. The GMDH approach also allows devel- 
oping the formula of the GMDH model due to inclusion of the additional procedures, which can be used 
to extend the scope of the application. GMDH neural models can be used in the identification of static 
and dynamic systems, both single-input single-output (SISO) and multi-input multi-output (MIMO). 


8-1 
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The main objective of this chapter is to present the structure and properties of GMDH neural net- 
works. In particular, the chapter is organized as follows: Section 8.2 outlines the fundamentals of 
GMDH neural networks and presents the processes of the synthesis of structure and parameters estima- 
tion of GMDH network. Section 8.3 is devoted to generalizations of GMDH approach. In particular, the 
problem of the modeling of dynamical systems is considered. Moreover, the processes of the synthesis of 
single-input and multi-output GMDH neural networks are described. The subsequent Section 8.4 shows 
an application of the GMDH neural model to the robust fault detection. Finally, Section 8.5 concludes 
the chapter. 


8.2 Fundamentals of GMDH Neural Networks 


The concept of the GMDH approach relies on replacing the complex neural model by the set of hierar- 
chically connected partial models. The model is obtained as a result of neural network structure synthe- 
sis with the application of the GMDH algorithm [7,14]. The synthesis process consists of partial model 
structure selection and parameter estimation. The parameters of each partial model (a neuron) are esti- 
mated separately. In the next step of process synthesis, the partial models are evaluated, selected, and 
included to the newly created neuron layers (Figure 8.1). During the network synthesis, new layers are 
added to the network. The process of network synthesis leads to the evolution of the resulting model 
structure to obtain the best quality approximation of real system output signals. The process is com- 
pleted when the optimal degree of network complexity is achieved. 


8.2.1 Synthesis of the GMDH Neural Network 


Based on the kth measurement of the system inputs u(k) e R’™, the GMDH network grows its first layer 
of neurons. It is assumed that all the possible couples of inputs from u\”(k),.. 3 rr (k), belonging to the 
training data set T, constitute the stimulation, which results in the formation of the neurons outputs 
AU) 

Vn (k): 


PACK) = ful) = f(a? b)y-.sth) a 


where 
lis the layer number of the GMDH network 
nis the neuron number in the /th layer 
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FIGURE 8.1 Synthesis of the GMDH-type neural network. 
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The GMDH approach allows much freedom in defining an elementary model transfer function f 
(e.g., tangent or logarithmic functions) [14,20]. The original GMDH algorithm developed by Ivakhnenko 
[13] is based on linear or second-order polynomial transfer functions, such as 


Ff (uj(k), uj(K)) = po + prui(k) + pouj(k) + psuj(k)uj(k) + pau; (k) + psu; (k). (8.2) 


In this case, after network synthesis, the general relation between the model inputs and the output 9(k) 
can be described in the following way: 


9 = ful) = pot Y pail) + YY piu) +... (8.3) 


i=l j=l 


From the practical point of view, (8.3) should be not too complex because it may complicate the learning 
process and extend the computation time. In general, in the case of the identification of static nonlinear 
systems, the partial model can be described as follows: 


where €(-) denotes a nonlinear invertible activation function, that is, there exists E70); Moreover, 
rO(k)= Ff (ul), w(K)" J, i,j=1,....n,and pr eR” are the regressor and parameter vectors, respec- 
tively, and f(.) is an arbitrary bivariate vector function, for example, f(x) = [x7 ,X3,X)X2,X1,X>,1]’, that 
corresponds to the bivariate polynomial of the second degree. 


The number of neurons in the first layer of the GMDH network depends on the number of the 
external inputs 7; 


HE) = flu), uP, B,) 
yo (k) = flu (k), us) (k), P, 3) 


> 


Yn, (k) = fu? (k), un (k), Pica) 


where P, > P,, Peer | ae are estimates of the network parameters and should be obtained during the 
identification process. 

At the next stage of GMDH network synthesis, a validation data set v, not employed during the 
parameter estimation phase, is used to calculate a processing error of each partial model in the current 
Ith network layer. The processing error can be calculated with the application of the evaluation criterion 
such as: the final prediction error (FPE), the Akaike information criterion (AIC) or the F-test. Based 
on the defined evaluation criterion it is possible to select the best-fitted neurons in the layer. Selection 
methods in GMDH neural networks play the role of a mechanism of structural optimization at the stage 
of constructing a new layer of neurons. During the selection, neurons that have too large a value of the 
evaluation criterion Q( gb) are rejected. 

A few methods of performing the selection procedure can be applied [20]. One of the most-often 
used is the constant population method. It is based on a selection of g neurons, whose evaluation 
criterion Q(7(k) reaches the least values. The constant g is chosen in an empirical way and the 
most important advantage of this method is its simplicity of implementation. Unfortunately, constant 
population method has very restrictive structure evolution possibilities. One way out of this problem 
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is the application of the optimal population method. This approach is based on rejecting the neurons 
whose value of the evaluation criterion is larger than an arbitrarily determined threshold e,. The 
threshold is usually selected for each layer in an empirical way depending on the task considered. The 
difficulty with the selection of the threshold results in the fact that the optimal population method is 
not applied too often. One of the most interesting ways of performing the selection procedure is the 
application of the method based on the soft selection approach. An outline of the soft selection method 
[18] is as follows: 


Input: The set of all n, neurons in the /th layer, n;—the number of opponent neurons, n,,—the number of 
winnings required for nth neuron selection. 


Output: The set of neurons after selection. 


1. Calculate the evaluation criterion Q( yk) for n = 1, ...,n, neurons. 

2. Conducta series of n, competitions between each nth neuron in the layer and n; randomly selected 
neurons (the so-called opponent) from the same layer. The nth neuron is the so-called winner 
neuron when 


QQ) < QGP), FL sn, 


where Ik) denotes a signal generated by the opponent neuron. 
3. Select the neurons for the (/ + 1)-th layer with the number of winnings bigger than n,, (the remain- 
ing neurons are removed). 


The property of soft selection follows from the specific series of competitions. It may happen that the 
potentially unfitted neuron is selected. Everything depends on its score in the series of competition. 
The main advantage of such an approach in comparison with other selection methods is that it is pos- 
sible to use potentially unfitted neurons, which in the next layers may improve the quality of the model. 
Moreover, if the neural network is not fitted perfectly to the identification data set, it is possible to 
achieve a network that possesses better generalization abilities. One of the most important parameters 
that should be chosen in the selection process is the number of n; opponents. A bigger value of n, makes 
the probability of the selection of a neuron with a small quality index lower. In this way, in an extreme 
situation, when n,>n,, the soft selection method will behave as the constant population method, which 
is based on the selection only of the best-fitted neurons. Some experimental results performed on a 
number of selected examples indicate that the soft selection method makes it possible to obtain a more 
flexible network structure. Another advantage, in comparison to the optimal population method, is that 
an arbitrary selection of the threshold is avoided. Instead, we have to select a number of winnings n,,. 
It is, of course, a less sophisticated task. 

After the selection procedure, the outputs of the selected neurons become the inputs to other neurons 
in the next layer: 


ul (k) = pCR), 
uf (k) = 92k), (8.5) 
Wy (k) = Hy (k)- 


In an analogous way, the new neurons in the next layers of the network are created. During the 
synthesis of the GMDH network, the number of layers suitably increases. Each time a new layer is 
added, new neurons are introduced. The synthesis of the GMDH network is completed when the 
optimum criterion is achieved. The idea of this criterion relies on the determination of the quality 
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index Q(y, (k)) for all N neurons included in the / layer. Qi, represents the processing error for the 
best neuron in this layer: 


mia = min Q(9,'(K)). (8.6) 


The values Q( p(k) can be determined with the application of the defined evaluation criterion, which 
was used in the selection process. The values Q{in are calculated for each layer in the network. The opti- 
mum criterion is achieved when the following condition occurs: 
opt = min Qhin- (8.7) 

Qt} represents the processing error for the best neuron in the network, which generates the model 
output. In other words, when additional layers do not improve the performance of the network, the 
synthesis process is stopped. 

To obtain the final structure of the network (Figure 8.2), all unnecessary neurons are removed, leav- 
ing only those that are relevant to the computation of the model output. The procedure of removing 
unnecessary neurons is the last stage of the synthesis of the GMDH neural network. 


8.2.2 Parameters Estimation of the GMDH Neural Network 


The application of the GMDH approach during neural network synthesis allows us to apply parameter 
estimation of linear-in-parameters model algorithms, for example, the least mean square (LMS) [7,14]. 
This follows from the fact that the parameters of each partial model are estimated separately and the 
neuron’s activation function &(-) fulfills the following conditions: 


1. &(-) is continuous and bounded, that is, Vu € R: a < E(u) < b. 
2. &(-) is monotonically increasing, that is, Vu, ye R: u < y iff G(u) < E(y). 
3. &(-) is invertible, that is, there exists E“(-). 


The advantage of the LMS approach is the simple computation algorithm that gives good results even 
for small sets of measuring data. Unfortunately, the usual statistical parameter estimation framework 
assumes that the data are corrupted by errors, which can be modeled as realizations of independent 
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random variables with a known or parameterized distribution. A more realistic approach is to assume 
that the errors lie between given prior bounds. It leads directly to the bounded error set estimation class 
of algorithms, and one of them, called the outer bounding ellipsoid (OBE) algorithm [17,26], can be 
employed to solve the parameter estimation problem considered. 

The OBE algorithm requires the system output to be described in the form 


w(K) = (nO) p+ eR). (8.8) 
Moreover, it is assumed that the output error, e((k), can be defined as 
E(k) = yP(k) — PCR), (8.9) 


where 
y\(k) is the kth scalar measurement of the system output 
p(k) is the corresponding neuron output 


The problem is to estimate the parameter vector p°”, that is, to obtain p, as well as an associated 
parameter uncertainty in the form of the admissible parameter space E. In order to simplify the nota- 
tion, the index is omitted. As has been already mentioned, it is possible to assume that €(k) lies between 
given prior bounds. In this case, the output error is assumed to be bounded as follows: 


e"(k) <e(k) se" (hk), (8.10) 


where the bounds e”(k) and e™(k) (e"(k) # €@(A)) can be estimated [29] or are known a priori [17]. An 
example can be provided by data collected with an analogue-to-digital converter or for measurements 
performed with a sensor of a given type. Based on the measurements {r(k), y(K)}, k = 1...n, and the error 
bounds (8.10), a finite number of linear inequalities is defined. Each inequality associated with the kth 
measurement can be put in the following standard form: 


-1< y(k)-J(K)S1, (8.11) 
where 


2y(k)—e™ (k) -€"(k) 


y(k) = (8.12) 
HO Me") 
I= (8.13) 
y (ky —e"(k)> 5 = 
The inequalities (8.11) define two parallel hyperplanes for each kth measurement: 
H* = {peR": y(k)—r"(k-1)p=1h, (8.14) 
Ht ={p eR": y(e)—r"(k-Dp=-1}, (8.15) 
and bounding a strip S(k) containing a set of p values, which satisfy the constraints with y(k): 
S(k) = {peR":-1< 7()- 7 <1}. (8.16) 
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FIGURE 8.3 Recursive determination of the outer ellipsoid. 


By the intersection of the strips S(k) for the k= 1 ... n, measurements of the parameters, a feasible region 
IK is obtained and its center is chosen as the parameter estimate. Unfortunately, the polytopic region I 
becomes very complicated when the number of measurements and parameters is significant, and hence its 
determination is time consuming. An easier solution relies on the approximation of the convex polytopes 
S(k) by simpler ellipsoids. In a recursive OBE algorithm, which is based on this idea, the measurements 
are taken into account one after another to construct a succession of ellipsoids containing all values of 
p consistent with all previous measurements. After the first k observations, the set of feasible parameters 
is characterized by the ellipsoid: 


E( p(k), P(k)) = {p € R"?: (p— p(k)’ P“'(k)(p— p(k)) $4, (8.17) 


where 
p(k) is the center of the ellipsoid constituting kth parameter estimate 
P(k) is a positive-definite matrix that specifies its size and orientation 


By means of the intersection of the strip (8.16) and the ellipsoid (8.17), a region of possible parameter 
estimates is obtained. This region is outer bounded by a new IKi(k + 1) ellipsoid. The OBE algorithm 
provides rules for computing p(k) and P(k) in such a way that the volume of Ki(p(k + 1), P(k + 1)) is 
minimized (cf. Figure 8.3). The center of the last n,th ellipsoid constitutes the resulting parameter 
estimate, while the ellipsoid itself represents the feasible parameter set. However, any parameter vec- 
tor p contained in K(n,) is a valid estimate of p. A detailed structure of the OBE recursive algorithm is 
described in [26]. 


8.3 Generalizations of the GMDH Algorithm 


The assumptions of GMDH networks presented in Section 8.2 give a lot of freedom in defining the 
particular elements of the algorithm of synthesis. The mentioned possibilities relate to, for example, 
the definition of the transition function, evaluation criteria of the processing accuracy, or selec- 
tion methods. The concept of the GMDH also allows developing the formula of the GMDH net- 
work, through the application of additional procedures, which can be used to extend the scope of the 
application. 
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8.3.1 Dynamics in GMDH Neural Networks 


The partial model described in Section 8.2.1 can be used for the identification of nonlinear static systems. 
Unfortunately, most industrial systems are dynamic in nature [15,22]. The application of the static neural 
network will result in a large model uncertainty. Thus, during system identification, it seems desirable 
to employ models, which can represent the dynamics of the system. In the case of the classical neural 
network, for example, the multi-layer perceptron (MLP), the modeling problem of the dynamics is solved 
by the introduction of additional inputs. The input vector consists of suitably delayed inputs and outputs: 


y(k) = f (u(k),u(k -1),...,u(k—n,), y(k-)),....9(k =n), (8.18) 


where n, and n, represent the number of delays. Unfortunately, the described approach cannot be 
applied in the GMDH neural network easily, because such a network is constructed through gradual 
connection of the partial models. The introduction of global output feedback lines complicates the syn- 
thesis of the network. On the other hand, the behavior of each partial model should reflect the behavior 
of the identified system. It follows from the rule of the GMDH algorithm that the parameters of each 
partial model are estimated in such a way that their output is the best approximation of the real system 
output. In this situation, the partial model should have the ability to represent the dynamics. One way 
out of this problem is to use dynamic neurons [16]. 

Due to the introduction of different local feedbacks to the classical neuron model, it is possible to 
obtain several types of dynamic neurons. The most well-known architectures are the so-called neurons 
with local activation feedback [8], neurons with local synapse feedback [1], and neurons with output 
feedback [10]. The main advantage of networks constructed with the application of dynamic neurons 
is the fact that their stability can be proved relatively easily. As a matter of the fact, the stability of the 
network only depends on the stability of neurons. The feed-forward structure of such networks seems to 
make the training process easier. On the other hand, the introduction of dynamic neurons increases the 
parameter space significantly. This drawback together with the nonlinear and multi-modal properties of 
an identification index implies that parameter estimation becomes relatively complex. 

In order to overcome this drawback, it is possible to use another type of a dynamic neuron model [16]. 
Dynamic in such an approach is realized by the introduction of a linear dynamic system—an infinite 
impulse response (IIR) filter. In this way, each neuron in the network reproduces the output signal based 
on the past values of its inputs and outputs. Such a neuron model (Figure 8.4) consists of two submod- 
ules: the filter module and the activation module. 

The filter module is described by the following equation: 


T, 
7K) = (0) po (8.19) 
where 
1} (k) = [-F(k=D),.. Fk =), w(K), w(K =D)... (k= ny) ] is the regressor 
p® = [A),..-,An,,VosVis---» Vn, ] is the filter parameters 


The filter output is used as the input for the activation module: 

a ~(1) 

yk) = at j, (b)). (8.20) 
The application of dynamic neurons in the process of GMDH network synthesis can improve the model 
quality. To additionally reduce the uncertainty of the dynamic neural model, it is necessary to assume 


the appropriate order of the IIR filter. This problem can be solved by the application of the Lipschitz 
index approach based on the so-called Lipschitz quotients [21]. 
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FIGURE 8.4 Dynamic neuron model. 


8.3.2 Synthesis of the Single-Input Dynamic GMDH Neural Network 


The main advantage of the GMDH neural network is its application to systems with a large number of 
inputs. Unfortunately, this method is not deprived of weakness consisting in the impossibility of con- 
ducting network synthesis, when the number of inputs is less than three. For example, such a situation 


takes place during system identification of SISO systems. 
To overcome this problem, it is possible to decompose IIR filters in the way shown in Figure 8.5. 
A network of n, inputs is built from p input neurons with IIR filters. N new elements are formed in the 


following way: 


_ (ny, +ngn,)! 
~ pln, + nan, — p)!" 2) 


D+ 


uy(k—ng) 


Uy, (k) 
——e 


Uy, (k- Ng) 


Ink) 


FIGURE8.5 Decomposition of IIR filters. 
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As can be observed, as a result of such a decomposition, a greater number of inputs can be obtained in 
the next layer of the network. The further process of network synthesis is carried out in accordance with 
the manner described in Section 8.2. 


8.3.3 Synthesis of the Multi-Output GMDH Neural Network 


The assumptions of the GMDH approach presented in Section 8.2 lead to the formation of a neural 
network of many inputs and one output. The constructed structure approximates the dependence 
yk) = f (uy(k),-..,U,,(k)). However, systems of many inputs and many outputs y,(h), ..., yp(k) are found 
in practical applications most often. The synthesis of this model can be realized similarly as in the case 
of multi-input and single-output models. In the first step of network synthesis, based on all the combi- 
nations of inputs, the system output y, ”(k) is obtained. Next, based on the same combinations of inputs, 
the remaining outputs ws k),. hey wO(k) are obtained: 


JOCK) = f(uP(k), w(K) 


Pik) = ful Ch), w(K) 
(8.22) 
IK) = FUP), wh) 


JOD = fulalk), uO) 


Selection of best performing neurons, for their processing accuracy in the layer, is realized with applica- 
tion of selection methods described in Section 8.2. The independent evaluation of any of the processing 
errors Q,, Q,, ..., Qz is performed after the generation of each layer of neurons. 


QF, CK) 5. Qn ICR) 


QIK) eo QF) 
: (8.23) 
Q(Vei(K)) +--+, Qe Vey (K)) 


QF. (KD) QO). 


According to the chosen selection method, elements that introduce too big a processing error of each 
output y,(k), ..., yg(k) are removed (Figure 8.6). The effectiveness of a neuron in the processing of at least 
one output signal is sufficient to leave the neuron in the network. 

Based on all the selected neurons, a new layer is created. In an analogous way, new layers of the net- 
work are introduced. During the synthesis of next layers, all outputs from the previous layer must be 
used to generate each output y,(k), ..., yg(k). This follows from the fact that in real industrial systems, 
outputs are usually correlated, so the output yw should be obtained based on all the potential outputs 


Puy re-sIuy 


© 2011 by Taylor and Francis Group, LLC 


GMDH Neural Networks 8-11 


[2 || | EE | = | 
[*|l*}L* IML | 


fa} 
fa]: 


EEA 
| * | Ea a? | 


FIGURE 8.6 _ Selection in the MIMO GMDH network. 


The termination of the synthesis of the GMDH network in the presented solution appears indepen- 
dently of all values of the partial processing error Q,, Q,, ..., Qp (Figure 8.7): 


Qin = min Q(j.y(k)) for r=1,...,R. (8.24) 


The synthesis of the network is completed when each of the calculated criteria values reaches the 
minimum: 


Qoye = min in for r=l,...,R. (8.25) 


The output y,(k) is connected to the output of this neuron, for which Qi"in achieves the least value. The 
particular minimum could occur at different stages of network synthesis. This is why in the multi- 
output network, outputs of the resulting structure are usually in different layers (Figure 8.8). 
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FIGURE 8.7 Termination of the synthesis of the multi-output GMDH network. 
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FIGURE8.8 _ Final structure of the MIMO GMDH neural network. 


8.4 Application of GMDH Neural Networks 


The main objective of this section is to present the application of the GMDH neural model to robust 
fault detection. A fault can be generally defined as an unexpected change in a system of interest. Model- 
based fault diagnosis can be defined as the detection, isolation, and identification of faults in the system 
based on a comparison of system available measurements with information represented by the system 
mathematical model [15,22]. 


8.4.1 Robust GMDH Model-Based Fault Detection 
The comparison of the system y(k) and the model response #(k) leads to the generation of the residual: 
e(k) = y(k) — y(k), (8.26) 
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FIGURE 8.9 Robust fault detection with the adaptive time-variant threshold. 


which is a source of information about faults for further processing. In the model-based fault detection 
approach, it is assumed that e(k) should be close to zero in the fault-free case, and it should be distin- 
guishably different from zero in the case of a fault. Under such an assumption, the faults are detected by 
setting a fixed threshold on the residual. In this case, the fault can be detected when the absolute value 
of the residuum |e(A)| is larger than an arbitrarily assumed threshold value 6,. The difficulty with this 
kind of residual evaluation is that the measurement of the system output y(k) is usually corrupted by 
noise and disturbances e”(k) < e(k) < e“(k), where e”(k) < 0 and e“(k) = 0. Another difficulty follows 
from the fact that the model obtained during system identification is usually uncertain [19,29]. Model 
uncertainty can appear during model structure selection and also parameters estimation. In practice, 
due to modeling uncertainty and measurement noise, it is necessary to assign wider thresholds in order 
to avoid false alarms, which can imply a reduction of fault detection sensitivity. 

To tackle this problem, the adaptive time-variant threshold that is adapted according to system 
behavior can be applied. Indeed, knowing the model structure and possessing knowledge regarding 
its uncertainty, it is possible to design a robust fault detection scheme. The idea behind the proposed 
approach is illustrated in Figure 8.9. 

The proposed technique relies on the calculation of the model output uncertainty interval based on 
the estimated parameters whose values are known at some confidence level: 


y"(k) < p(k) S$ yk). (8.27) 


Additionally, as the measurement of the controlled system response y(k) is corrupted by noise, it is nec- 
essary to add the boundary values of the output error e”(k) and e“(k) to the model output uncertainty 
interval. A system output interval defined in this way should contain the real system response in the 
fault-free mode. The occurrence of a fault is signaled when the system output y(k) crosses the system 
output uncertainty interval: 


y(k) +e"(k) < y(k)< yk) +e" (k). (8.28) 


The effectiveness of the robust fault detection method requires determining of a mathematical descrip- 
tion of model uncertainty and knowing maximal and minimal values of disturbances €. 

To solve this problem, the GMDH model can be applied, which is constructed according to the proce- 
dure described in Section 8.2.1. At the beginning, it is necessary to adapt the OBE algorithm to param- 
eter estimation of the partial models (8.8) with the nonlinear activation function &(-). In order to avoid 
the noise additivity problem, it is necessary to transform the relation 


eS p(k) -E((n 49)’ p? se") (8.29) 
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> 


FIGURE 8.10 Relation between the size of the ellipsoid and model output uncertainty. 


using €1(.), and hence 


E+ (ye) -e™(K)) < (nO) pl? <E7(y()—-e"(W)). (8.30) 


The transformation (8.30) is appropriate if the conditions concerning the properties of the nonlinear 
activation function &(-) are fulfilled. The methodology described in Section 8.2.2 makes it possible to 
obtain p and E. But from the point of view of applications of the GMDH model to fault detection, it 
is important to obtain the model output uncertainty interval. The range of this interval for the partial 
model output depends on the size and the orientation of the ellipsoid E (cf. Figure 8.10). 

Taking the minimal and maximal values of the admissible parameter set E into consideration, it is 
possible to determine the minimal and maximal values of the model output uncertainty interval for 
each partial model of the GMDH neural network: 


r'(k)p —yr" (k)Pr(k) <r" (k)p <r" (k)p +yr'(k)Pr(k). (8.31) 


The partial models in the /th (/ > 1) layer of the GMDH neural network are based on outputs incoming 
from the (/ — 1)-th layer. Since (8.31) describes the model output uncertainty interval in the (/ — 1)-th 
layer, parameters of the partial models in the next layers have to be obtained with an approach that solves 
the problem of an uncertain regressor. Let us denote an unknown “true” value of the regressor r,,(k) by 
a difference between a known (measured) value of the regressor r(k) and the error in the regressor e(k): 


r,(k) = r(k) — e(k), (8.32) 
where the regressor error e(k) is bounded as follows: 


-e; Se(k)<e;,, i=1,...,n (8.33) 


p 


Substituting (8.32) into (8.31), it can be shown that the partial models output uncertainty interval has 
the following form: 


p(k) Sr" (kp sy “(k), (8.34) 
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FIGURE 8.11 Fault detection via the system output uncertainty interval. 
where 
p(k) =r] (k)p +e" (K)p-V(r,(k) +e)! PCr, (k) + e(W)), (8.35) 
Ik) =n] (Wp +e" (Kp +V7,(b) + e(W))' P(r, (k) + eK). (8.36) 


In order to obtain the final form of the expression (8.34), it is necessary to take into consideration the 
bounds of the regressor error (8.33) in the expressions (8.35) and (8.36): 


Np 


V") = np + Ysen(b)pes—friOPr(k), (8.37) 


i=l 


ep 


iW =ni Wp + Y'san(p)ie. + lr OPr(W), (8.38) 


where 
Tnilk) = i(k) + sgn(1,;(k))e:. (8.39) 


The GMDH model output uncertainty interval (8.37) and (8.38) should contain the real system response in 
the fault-free mode. As the measurements of the system response are corrupted by noise, it is necessary to 
add the boundary values of the output error (8.10) to the model output uncertainty interval (8.37) and (8.38). 
The newly defined interval (Figure 8.11) is called the system output uncertainty interval and it is calculated 
for the partial model in the last GMDH neural network layer, which generates the model output. The occur- 
rence ofa fault is signaled when the system output signal crosses the system output uncertainty interval. 


8.4.2 Robust Fault Detection of the Intelligent Actuator 


In order to show the effectiveness of the GMDH model-based fault detection system, the actuator model 
from the Development and Application of Methods for Actuator Diagnosis in Industrial Control System 
(DAMADICS) benchmark [2] was employed (Figure 8.12), where V,, V,, and V, denote the bypass valves, 
ACQ and CPU are the data acquisition and positioner central processing units, respectively. E/P and FT 
are the electro-pneumatic and value flow transducers. Finally, DT and PT represent the displacement 
and the pressure transducers. On the ground of process analysis and taking into account the expert 
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FIGURE 8.12 Diagram of the actuator. 


process knowledge, the following model of the juice flow at the outlet of the valve F = r,(X, P,, P,, T,) and 
the servomotor rod displacement X = ry(Cy, P;, P,, T,) were considered, where r,(-) and rx(-) denote the 
modeled relationships, C, is the control valve, P, and P, are the pressures at the inlet and the outlet of the 
valve, respectively, and T, represents the juice temperature at the inlet of the valve. 

The DAMADICS benchmark makes it possible to generate data for 19 different faults. In the benchmark 
scenario, the abrupt A and incipient J faults are considered. Furthermore, the abrupt faults can be regarded 
as small S, medium M, and big B, according to the benchmark descriptions. The synthesis process of the 
GMDH neural network proceeds according to the steps described in Section 8.2.1. During the synthesis 
of the GMDH networks, dynamic neurons with the IIR filter (8.19) were applied. The selection of best per- 
forming neurons in each layer of the GMDH network in terms of their processing accuracy was realized 
with the application of the soft selection method based on the following evaluation criterion: 


re 
@=—)) 


V ‘k=l 


(5“W+e")-(7" G+ er) (8.40) 


The values of this criterion were calculated separately for each neuron in the GMDH network, whereas 
y™(k) and y™(k) in (8.40) were obtained with (8.31) for the neurons in the first layer of the network and 
with (8.34) for the subsequent ones. Table 8.1 presents the results for the subsequent layers, that is, these 
values were obtained for the best performing partial models in a particular layer. 

The results show that the gradual decrease of the value of the evaluation criteria occurs when a new 
layer of the GMDH network is introduced. It follows from the increasing of the model complexity as well 
as its modeling abilities. However, when the model is too complex, the quality index Q, increases. This 
situation occurs when the fifth layer of the network is added. It means that the model corresponding to 
F = r,(.) and X = r,() should have only four layers. The final structures of GMDH neural networks are 
presented in Figures 8.13 and 8.14. 

After the synthesis of the F = r,(-) and X = r,(-) GMDH models, it is possible to employ them for 
robust fault detection. This task can be realized with the application of the system output interval 
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TABLE 8.1 Evolution of Q, for the Subsequent Layers 


Layer 1 2 3 4 5 


rA-) 1.1034 1.0633 1.0206 0.9434 1.9938 
ryf-) 0.3198 0.2931 0.2895 0.2811 0.2972 


FIGURE 8.13 Final structure of F = r,(-). 


FIGURE 8.14 Final structure of X = r,(-). 
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FIGURE 8.15 System response and the system output uncertainty interval for the big abrupt fault f,. 
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defined in Section 8.4.1. Figures 8.15 through 8.18 present the system responses and the corre- 
sponding system output uncertainty intervals for the faulty data, where ¢; denotes the moment of 
fault occurrence. 

Table 8.2 shows the results of fault detection of all the faults considered. The notation given in 
Table 8.2 can be explained as follows: ND means that it is impossible to detect a given fault, D, or Dy 
means that it is possible to detect a fault with r,(-) or r,(), respectively, while D,, means that a given fault 
can be detected with both r,(-) or ry). From the results presented in Table 8.2, it can be seen that it is 
impossible to detect the faults f;, f,, and f,,. Moreover, some small and medium faults cannot be detected, 
that is, f, and f,,. This situation can be explained by the fact that the effect of these faults is at the same 
level as the effect of noise. 
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FIGURE 8.16 System response and the system output uncertainty interval for the incipient fault f,. 
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FIGURE 8.17 System response and the system output uncertainty interval for the incipient fault f,. 
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FIGURE 8.18 System response and the system output uncertainty interval for the abrupt medium fault f. 
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TABLE 8.2 Results of Fault Detection 
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Faults Description 


Valve clogging 

Valve plug or valve seat sedimentation 
Valve plug or valve seat erosion 

Increase of valve or busing friction 

External leakage 

Internal leakage (valve tightness) 

Medium evaporation or critical flow 
Twisted servomotor’s piston rod 
Servomotors housing or terminals tightness 
Servomotor’s diaphragm perforation 
Servomotor’s spring fault 
Electro-pneumatic transducer fault 

Rod displacement sensor fault 

Pressure sensor fault 

Positioner feedback fault 

Positioner supply pressure drop 
Unexpected pressure change across the valve 
Fully or partly opened bypass valves 

Flow rate sensor fault 


Note: S, small; M, medium; B, big; J, incipient. 


8.5 Conclusions 


ND 
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One of the crucial problems occurring during system identification with the application of the ANNs is 
the choice of an appropriate neural model architecture. The GMDH approach solves this problem and 
allows us to chose such an architecture directly only on the basis of measurements data. The application of 
this approach leads to the evolution of the resulting neural model architecture in such a way so as to obtain 
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the best quality approximation of the identified system. It is worth emphasizing that the original GMDH 
algorithm can be easily extended and applied to the identification of dynamical SISO and MIMO systems. 

Unfortunately, irrespective of the identification method used, there is always the problem of model 
uncertainty, that is, the model-reality mismatch. Even though the application of the GMDH approach 
to the model structure selection can improve the quality of the model, the resulting structure is not 
the same as that of the system. The application of the OBE algorithm to parameters estimation of the 
GMDH model also allows us to obtain a neural model uncertainty. 

In the illustrative part of the chapter, an example concerning a practical application of GMDH neural 
model was presented. The calculation of the GMDH model uncertainty in the form of the system output 
uncertainty interval allowed performing robust fault detection of industrial systems. In particular, the 
proposed approach was tested on fault detection of the intelligent actuator connected with an evapora- 
tion station. The obtained results show that almost all faults can be detected, except for a few incipient 
or small ones. 
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The construction process of an artificial neural network (ANN), which is required to solve a given 
problem, usually consists of four steps [40]. In the first step, a set of pairs of input and output patterns, 
which should represent characteristics of a problem as well as possible, is selected. In the next step, an 
architecture of the ANN, the number of units, their ordering into layers or modules, synaptic connec- 
tions, and other structure parameters are defined. In the third step, free parameters of the ANN (e.g., 
weights of synaptic connections, slope parameters of activation functions) are automatically trained 
using a set of training patterns (a learning process). Finally, the obtained ANN is evaluated in accor- 
dance with a given quality measure. The process is repeated until the quality measure of the ANN is sat- 
isfied. Recently, in practice, there have appeared many effective methods of ANN training and quality 
estimation. But researchers usually choose an ANN architecture and select a representative set of input 
and output patterns based rather on their intuition and experience than on an automatic procedure. In 
this chapter, algorithms of automatic ANN structure selection are presented and analyzed. 

The chapter is ordered in the following way: First, the problem of ANN architecture selection is 
defined. Next, some theoretical aspects of this problem are considered. Chosen methods of automatic 
ANN structure selection are presented in a systematic way in the following sections. 


9.1 Problem Statement 


Before the problem of ANN architecture optimization is formulated, some useful definitions will be 
introduced. The ANN is represented by an ordered pair NN = (NA, v) [7,40,41]. NA denotes the ANN 
architecture: 


NA = ({V; |i=0,..., M}€). (9.1) 


{V,|i=0,...,M} isa family of M + 1 sets of neurons, called layers, including at least two nonempty sets V, 
and V,, that define s, = card(V,) input and s,, = card(V,,) output units, respectively, E is a set of connec- 
tions between neurons in the network. The vector v contains all free parameters of the network, among 


9-1 
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which there is the set of weights of synaptic connections w: € — R. In general, sets {V; | i= 0, ..., M} do 
not have to be disjunctive, thus there can be input units that are also outputs of the NN. Units that do not 
belong to either Vy) or Vy, are called hidden neurons. If there are cycles of synaptic connections in the 
set €, then we have a dynamic network. 

The most popular type of neural networks is a feed-forward neural network, called also a multilayer 
perceptron (MLP), whose architecture possesses the following properties: 


Vi#j ViAV,=9, (9.2) 
M-1 

€=|_JvixVia. (9.3) 
i=0 


Layers in the MLP are disjunctive. The main task of the input units of the layer V) is preliminary input 
data processing u = {u, | p = 1, 2, ..., P} and passing them into units of the hidden layer. Data processing 
can comprise, for example, scaling, filtering, or signal normalization. Fundamental neural data process- 
ing is carried out in hidden and output layers. It is necessary to notice that links between neurons are 
designed in such a way that each element of the previous layer is connected with each element of the next 
layer. There are no feedback connections. Connections are assigned suitable weight coefficients, which 
are determined, for each separate case, depending on the task the network should solve. 

In this chapter, for simplicity of presentation, the attention will be focused on the methods of MLP 
architecture optimization based on neural units with monotonic activation functions. The wide class of 
problems connected with the RBF, recurrent, or cellular networks will be passed over. 

The fundamental learning algorithm for the MLP is the BP algorithm [48,58]. This algorithm is of 
iterative type and it is based on the minimization of a sum-squared error utilizing optimization gradi- 
ent-descent method. Unfortunately, the standard BP algorithm is slowly convergent; however, it is widely 
used and in recent years its numerous modifications and extensions have been proposed. Currently, the 
Levenberg-Marquardt algorithm [16] seems to be most often applied by researchers. 

Neural networks with the MLP architecture owe their popularity to many effective applications, for 
example, in the pattern recognition problems [30,53] and the approximation of nonlinear functions [22]. 
It has been proved that using the MLP with only one hidden layer and a suitable number of neurons, 
it is possible to approximate any nonlinear static relation with arbitrary accuracy [6,22]. Thus, taking 
relatively simple algorithms applied to MLP learning into consideration, this type of network becomes a 
very attractive tool for building models of static systems. 

Let us consider the network that has to approximate a given function f(u). Let = {(u, y)} be a set 
of all possible (usually uncountable) pairs of vectors from the domain u € I) c R” and from the range 
y €D’ CR”, which realize the relation y = f(u). The goal is to construct an NN with an architecture 
NA’*' and a set of parameters v°?', which fulfills the relation yy, , = fy, ,(u) that a given cost function 
SUPwyco J r(Ynaw Y) Will be minimized. So, the following pair has to be found: 


(vA 9) =argmin| sup Ityran)} (9.4) 


(u,y)e® 
Practically, a solution of the above problem is not possible to be obtained because of the infinite cardi- 


nality of the set ®. Thus, in order* to estimate the solution, two finite sets ®,, ®, C ® are selected. 
The set ®, is used in a learning process of an ANN of a given architecture NA: 


y= argmin| max IGyvas-9) (9.5) 


veV | (u,y)e®, 
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where V is the space of network parameters. In general, cost functions of learning, J;(yy4,,y), and 
testing, J;(yy4,.y), processes can have different definitions. The set ®; is used in the searching process 
of the NA*, for which 


NA* = arg min , max Inv) (9.6) 


NAcA| (u,y)e®r 


where A is the space of neural-network architectures. Obviously, the solutions of both tasks (9.5) and 
(9.6) need not necessarily be unique. Then, a definition of an additional criterion is needed. 

There are many definitions of the selection of the best neural-network architecture. The most popular 
ones are [41] 


¢ Minimization of the number of network free parameters. In this case, the subset 
Ag= (NA? Ir(Ynay 9) $8}C A 7) 


is looked for. The network with the architecture NA € A; and the smallest number of training 
parameters is considered to be optimal. This criterion is crucial when VLSI implementation of the 
neural network is planned. 

e Maximization of the network generalization ability. The sets of training ®, and testing ®, patterns 
have to be disjunctive, ®, 7 ®; = ©. Then, J; is the conformity measure between the network reply 
on testing patterns and desired outputs. Usually, both quality measures J, and J; are similarly 


defined: 


card(®z(7)) 


Tia (YnawyY) = > (Ynay = yy. (9.8) 


k=1 


The restriction of the number of training parameters is the minor criterion in this case. The above 
criterion is important for approximating networks or neural models. 

¢ Maximization of the noise immunity. This criterion is used in networks applied in classification 
or pattern recognition problems. The quality measure is the maximal noise level of the pattern 
that is still recognized by the network. 


Two first criterions are correlated. Gradually decreasing the number of hidden neurons and synap- 
tic connections causes the drop of nonlinearity level of the network mapping, and then the network 
generalization ability increases. The third criterion needs some redundancy of the network param- 
eters. This fact usually clashes with previous criterions. In most publications, the second criterion 
is chosen. 

The quality of the estimates obtained with neural networks strongly depends on selection finite learn- 
ing ®, and testing ®, sets. Small network structures may not be able to approximate the desired relation 
between inputs and outputs with a satisfying accuracy. On the other hand, if the number of network 
free parameters is too large (in comparison with card(®,)), then the function fy,» ,.(u) realized by the 
network strongly depends on the actual set of learning patterns (the bias/variance dilemma [12]). 

It is very important to note that the efficiency of the method of neural-network architecture optimi- 
zation strongly depends on the learning algorithm used. In the case of a multimodal topology of the 
network error function, the effectiveness of the classical learning algorithms based on the gradient- 
descent method (e.g., the BP algorithm and its modifications) is limited. These methods usually localize 
some local optimum and the superior algorithm searching for the optimal architecture receives wrong 
information about the trained network quality. 
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9.2 MLP as a Canonical Form Approximator of Nonlinearity 


Let us consider the MLP network with two hidden layers and units with a sigmoid activation function 
(Figure 9.1). Four spaces can be distinguished: the input space U and its successive patterns Y,,, = R,,,(U), 
Yio = Ry2(Yp,), and Y = R,(Y,,), where R,,,, R,,, and R, are mappings realized by both hidden and output 
layers, respectively. Numbers of input and output units are defined by dimensions of input and out- 
put spaces. The number of hidden units in both hidden layers depends on an approximation problem 
solved by a network. Further deliberations are based on the following theorem [56]: 


Theorem 9.2.1 Let ®, be a finite set of training pairs associated with finite and compact manifolds. 
Let f be some continuous function. Taking into account the space of three-level MLPs, there exists an 
unambiguous approximation of the canonical decomposition of the function f, if and only if the num- 
ber of hidden neurons in each hidden layer is equal to the dimension of the subspace of the canonical 
decomposition of the function f 


Theorem 9.2.1 gives necessary and sufficient conditions for the existence of MLP approximation of 
the canonical decomposition of any continuous function. These conditions are as follows: Uand Y must 
be fully represented by the learning set ®,. The network contains more than two hidden layers, which 
are enough for implementing the discussed approximation of the canonical decomposition of any con- 
tinuous function. The goal of the first hidden layer is to map the n-dimensional input space U into the 
space Y,, = R,,(U), which is an inverse image of the output space in the sense of the function f. Thus, 
the mapping Y,, > Y is invertible. The number of units in the first hidden layer card(V,) is equal to the 
dimension of the minimal space, which still fully represents input data and is, in general, lower than 
the dimension of input vectors. 

Theorem 9.2.1 guarantees that an approximation of the canonical form of the function f exists and 
is unambiguous. If card(V,) is higher than the dimension of the canonical decomposition space of the 
function f, the network does not approximate the canonical decomposition but can still be the best 
approximation of the function f, However, such an approximation is not unambiguous and depends on 
the initial condition of the learning process. On the other hand, if the number card(V,) is too low, the 
obtained approximation is not optimal. So, both the deficiency and excess of neurons in the first hidden 
layer lead to poor approximation. 

As has been pointed out above, the first layer reduces the dimension of the actual input space to the 
level sufficient for optimal approximation. The next two layers, the second hidden one and the output 
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FIGURE9.1 MLP network with three layers. 
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one, are sufficient for the realization of such an approximation [6,22]. The number of units in the second 
hidden layer card(V,) is determined by an assumed error of approximation. The lowest error needs a 
higher card(V,). The crucial tradeoff that one has to make is between the learning capability of the MLP 
and fluctuations due to the finite sample size. If card(V,) is too small, the network might not be able to 
approximate the functional relationship between the input and target output well enough. If card(V,) is 
too great (compared to the number of learning samples), the realized network function will depend too 
much on the actual realization of the learning set [12]. 

The above consideration suggests that the MLP can be used for the approximation of the canonical 
decomposition of any function specified on the compact topological manifold. The following question 
comes to mind: Why is the canonical decomposition needed? Usually, essential variables, which fully 
describe the input-output relation, are not precisely defined. Thus, the approximation of this relation 
can be difficult. The existence of the first layer allows us to transform real data to the form of the com- 
plete set of variables of an invertible mapping. If the input space agrees with the inverse image of the 
approximated mapping, the first hidden layer is unnecessary. 


9.3 Methods of MLP Architecture Optimization 
9.3.1 Methods Classification 


Procedures that search for the optimal ANN architecture have been studied for a dozen or so years, 
particularly intensively in the period between 1989 and 1991. At that time, almost all standard solu- 
tions were published. In subsequent years, the number of publications significantly decreased. Most 
of the proposed methods were dedicated to specific types of neural networks. But new results are still 
needed. 

There is a big collection of bibliography items and various methods available to solve this problem. 
Recently, a variety of architecture optimization algorithms have been proposed. They can be divided 
into three classes [7,40,41]: 


¢ Bottom-up approaches 
¢ Top-down (pruning) approaches 
¢ Discrete optimization methods 


Starting with a relatively small architecture, bottom-up procedures increase the number of hidden units and 
thus increase the power of the growing network. Bottom-up methods [2,9,10,21,33,51,55] prove to be the most 
flexible approach, though computationally expensive (complexity of all known algorithms is exponential). 
Several bottom-up methods have been reported to learn even hard problems with a reasonable computational 
effort. The resulting network architectures can hardly be proven to be optimal. But further criticism concerns 
the insertion of hidden neurons as long as elements of the learning set are misclassified. Thus, the resulting 
networks exhibit a poor generalization performance and are disqualified for many applications. 

Most neural-network applications use the neural model of binary, bipolar, sigmoid, or hyperbolic 
tangent activation function. A single unit of this type represents a hyperplane, which separates its 
input space into two subspaces. Through serial—parallel unit connections in the network, the input 
space is divided into subspaces that are polyhedral sets. The idea of the top-down methods is gradual 
reduction of the hidden unit number in order to simplify the shapes of the division of the input space. 
In this way, the generalization property can be improved. Top-down approaches [1,4,5,11,18,23,28,29, 
35,39,45,46,57] inherently assume the knowledge of a sufficiently complex network architecture that 
can always be provided for finite size learning samples. Because the algorithms presented up to now 
can only handle special cases of redundancy reduction in a network architecture, they are likely to 
result in a network that is still oversized. In this case, the cascade-reduction method [39], where the 
obtained architecture using a given top-down method is an initial architecture for the next searching 
process, can be a good solution. 
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The space of ANN architectures is infinite discrete, and there are many bibliography items dealing 
with the implementation of discrete optimization methods to solve the ANN architecture optimiza- 
tion problem. In particular, evolutionary algorithms, especially genetic algorithms (GA), seem to have 
gained a strong attraction within this context (c.f. [3,15,17,19,25-27,32,34,36,44,50,54]). Nevertheless, 
implementations of the A* algorithm [7,38,41,43], the simulating annealing [31,40,41] and the tabu 
search [31,41,43] deserve an attention. 


9.3.2 Bottom-Up Approaches 


One of the first bottom-up methods was proposed by Mezard and Nadal [33]. Their tiling algorithm is 
dedicated for the MLP that has to map Boolean functions of binary inputs. Creating subsequent layers 
neuron by neuron, the tiling algorithm successively reduces the number of learning patterns, which 
are not linearly separable. A similar approach was introduced by Frean [10]. Both algorithms give MLP 
architectures in a finite time, and these architectures aspire to be almost optimal. In [21], an extension 
of the back-propagation (BP) algorithm was proposed. This algorithm allows us to add or reduce hidden 
units depending on the actual position of the training process. Ash [2] and Setiono and Hui [51] stated 
that the training process of sequentially created networks is initiated using values of parameters from 
previously obtained networks. Wang and coworkers [55] built an algorithm based on their Theorem 
9.2.1 [56], which describes necessary and sufficient conditions under which there exists neural-network 
approximation of the canonical decomposition of any continuous function. The cascade-correlation 
algorithm [9] builds an ANN of an original architecture. 


9.3.2.1 Tiling Algorithm 


The tiling algorithm [33] was proposed for feed-forward neural networks with one output and binary 
function activation of all neurons. Using such a network, any Boolean function of n inputs or some 
approximation of such a function (ifthe number of learning patterns p < 2") can be realized. The authors 
propose a strategy in which neurons are added to the network in the following order (Figure 9.2): The 
first neuron of each layer fulfils a special role and is called a master unit. An output of the master unit 
of the latest added layer is used for the calculation of a recent network quality measure. In the best case, 
the output of the network is faultless and the algorithm run is finished. Otherwise, auxiliary nodes are 
introduced into the last layer until the layer outputs become a “suitable representation” of the problem, 
that is, for two different learning patterns (with different desired outputs) the output vectors of the layer 
are different. If the “suitable representation” is achieved, then the layer construction process is finished 
and a new master unit of a new output layer is introduced and trained. 


> YNA 


FIGURE9.2 Order of node adding to the network in the tiling algorithm. 
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FIGURE 9.3 Upstart network. 


In detail, the layer construction process is as follows: Let the layer V,_, be “suitable”; then the set 
of the learning patterns can be divided into classes {C’|v = 1...p,_,}, which are attributed to different 
activation vectors (so-called prototypes) {t|v = 1...p,_,}. Each learning pattern of C’ is attributed to the 
same desired output y’. The master unit of the layer V,, is trained by Rosenblatt’s algorithm [47] using 
the learning pairs {(t’, y’)|v = 1...p,_,}. If the output error is equal to zero, then the patterns {(t’, y’)|v = 
1...p,_,} are linearly separated and the master unit of V, is the output unit of the network. If the output 
error of the master unit of V;, is not equal to zero, then auxiliary nodes have to be introduced. Two nodes 
are just in the layer V,: the bias and master unit. Thus, the learning patterns belong to two classes of 
prototypes: (t =1,t; =0) and (t =1,t; =1). Because the output error of the master unit of V, is not 
faultless, there exists at least one “unsuitable” class C". A new auxiliary unit is trained the relationship 
Tt — y#, using only patterns from the “unsuitable” class C". After training, the class C" can be divided 
into two “suitable” classes. In the other case, the “unsuitable” subclass of C" is used to train a new aux- 
iliary unit. Such a consistent procedure leads to the creation of the “suitable” layer V,. Tiling algorithm 
convergence is proved in [33]. 


9.3.2.2 Upstart Algorithm 


The upstart algorithm was proposed by Frean [10]. This method is also dedicated to feed-forward 
neural networks with one output and binary function activation of all neurons. Unlike the tiling 
algorithm, the upstart method does not build a network layer by layer from the input to output, but 
new units are introduced between the input and output layers, and their task is to correct the error of 
the output unit. 
Let the network have to learn some binary classification. The basic idea is 
as follows: A given unit Z generates another one, which corrects its error. Two 
TABLE9.1 Desired types of errors can be distinguished: the switch on fault (1 instead of 0) and 
Outputs of (a) X and the switch off fault (0 instead of 1). Let us consider the switch on fault case. The 
(b) Y Depending on answer of Z can be corrected by a big negative weight of the synaptic connec- 
Current 0, and Desired tion from a new unit X, which is active only in the case of the switch on fault of Z. 
tz Outputs of Z If Z is in the state of the switch off fault, then this state can be corrected by a 
ty big positive weight of connection from a new unit Y, which is active in suitable 
, _ time. In order to fit weights of the input connections to the units X and Y using 


ee ° Rosenblatt’s algorithm, their desired outputs are specified based on the activity 
(a) of Z. The X and Y units are called daughters and Z is the parent. The scheme of 
1" ‘ ° the building process is shown in Figure 9.3, and the desired outputs of the X 
: } ° and Yunits are presented in Table 9.1. 

) The network architecture generated by the upstart algorithm is not con- 
; : ; ventional. It possesses the structure of a hierarchical tree and each unit is 


connected with network input nodes. 
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9.3.2.3 BP Algorithm That Varies the Number of Hidden Units 


The BP algorithm seems to be the most popular procedure of MLP training. Unfortunately, this 
algorithm is not devoid of problems. The learning process sometimes gets stuck in a local, unsatis- 
fied optimum of the weight space. Hirose and coworkers [21] found that the increase of the weight 
space dimension by adding hidden neurons allows the BP algorithm to escape from the local opti- 
mum trap. The phase of MLP growth is finished when the result of the learning process is satisfied. 
The obtained network is usually too big, described by too many free parameters. Thus, the next 
cyclic phase is started, in which hidden units are in turn removed until the learning algorithm con- 
vergency is lost. 

The proposed algorithm effectively reduces the time of MLP design. But the generalization ability is 
limited. Usually, the obtained network fits the learning set of patterns too exactly. 


9.3.2.4 Dynamic Node Creation 


The dynamic node creation algorithm was proposed by Ash [2]. This procedure is dedicated to MLP 
design with one hidden layer and is similar to the algorithm described in the previous paragraph. The 
initial architecture contains a small number of hidden units (usually two). Next, neurons are added to the 
network until the desired realization is fulfilled with a given accuracy. After adding a given unit, the net- 
work is trained using the BP algorithm. The crucial novelty is the fact that the learning process, started 
after adding a new neuron, is initialized from the weights obtained in the previous learning process 
of the smaller network. Only new free parameters (connected with the new node) of the network are 
randomly chosen. The main success of the dynamic node creation algorithm is the construction of the 
network with six hidden units, which solve the problem of the parity of six bits. This solution is very 
difficult to obtain when the network with six hidden neurons is trained by the BP algorithm initialized 
from a set of randomly chosen weights. 

Setiono and Chi Kwong Hui [51] developed the dynamic node creation algorithm by implement- 
ing a learning process based on the BCFG optimization algorithm rather than on the BP algorithm. 
The application of this modification accelerates the convergence of the learning process. Taking 
into account the fact that, in the initial phase, the BCFG algorithm trains a network of a small size, 
the problem of the memory space needed for the implementation of the Hessian matrix estimate is 
negligible. 


9.3.2.5 Canonical Form Method 


Theorem 9.2.1 as well as the definition of 5-linear independence of vectors presented below are the basis 
of the canonical form method [55]. 


Definition 9.3.1 Let E be an n-dimensional vector space and {a;}j-; be vectors from E. The set of vectors 
{a,}j- is called 6-linear independent if the matrix [a;];-, fulfills the following relationship: 


det(A"A) >, (9.9) 


where 6 is a given positive constant. 

Let U be a matrix whose columns are created by the successive input learning patterns. Let Y, = 
R,U and Y, = R,Y, be matrices whose columns are vectors of replays of first and second hidden layer, 
respectively, on input patterns. Based on Theorem 9.2.1 and Definition 9.3.1, Wang and coworkers [55] 
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prove that the optimal numbers of units in the first p, and second p, layers have to fulfill the following 
inequalities: 


det(W¥,) >, det(¥/'Y, } <5, (9.10) 


pitl 


det( YY, | >6, det( YY, } <8 (9.11) 


P2 Pat 


where det( Y/Y, } means that the matrix Y; has p, rows. 
Pi 


The p, and p, searching process is initialized from a relatively small number of units in both hidden 
layers. The inequalities (9.10) and (9.11) (or some of their modifications [55]) are checked for each ana- 
lyzed MLP architecture after the learning process. If the inequalities are not met, then a new node is 
added to an appropriate hidden layer and the network is trained again. 


9.3.2.6 Cascade-Correlation Method 


The cascade-correlation method [9] is similar to the upstart method, but it is applied to networks of 
continuous activation functions of nodes. The algorithm iteratively reduces the error generated by out- 
put units. In order to obtain this effect, the hidden nodes, which correlate or anti-correlate the quality 
measure based on the output response, are introduced into the network. The cascade-correlation algo- 
rithm is initialized with a structure that contains only output processing units. Free parameters of these 
units are trained using a simple gradient-descent method. If the response of the network is satisfied, 
then algorithm processing is finished. Otherwise, a candidate node to be a hidden unit is introduced. 
This candidate unit receives signals from all network inputs and previously introduced hidden units. 
The output of the candidate unit is not connected with the network in this stage. Parameters tuning of 
the candidate is based on the maximization of the following quality measure: 


m 


i=) » (y-(u) — y. (E(u) - E,)), (9.12) 


i=l \(u,y)e®y 


where 
m is the number of output units 
®, is a set of learning patterns 
y,(u) is the response of the candidate on inputs u 
E,(u) is the output error of the ith output unit generated by inputs u 


a 1 
7-(ads5| >, y(u), 


(u,y)e@y 


ea ; 
a) Ss E,(u). 


u,y)e®, 


For J, maximization, the gradient-ascent method is applied. When the learning process of the candi- 
date is finished, it is included in the network, its parameters are frozen and all free parameters of the 
output units are trained again. This cycle is repeated until the output error is acceptable. A sample of 
the cascade-correlation network is presented in Figure 9.4. The black dots represent connection weights 
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FIGURE 9.4 Sample of the cascade-correlation network with two inputs and two outputs. 


between units. It is important to notice that the obtained network is not optimal in the sense of the 
number of network free parameters but in the sense of modeling quality. 


9.3.3. Top-Down (Pruning) Methods 


An ANN architecture that is able to be trained with an acceptably small accuracy has to be initially 
created. Such an architecture is characterized by some redundancy. Top-down methods try to reduce 
the number of free parameters of the network, preserving the leaning error at an acceptable level. Three 
classes of top-down methods can be distinguished: 


¢ Penalty function methods 
¢ Sensitivity methods 
¢ Methods based on information analysis 


In the first class, a penalty function, which punishes too big architectures, is added to the network qual- 
ity criterion. In the second class, synaptic connections, for which the weight influence on the quality 
measure J (9.6) is negligibly small, are eliminated. The third class can be treated as an expanded version 
of the second one. The decision regarding given node pruning is made after an analysis of the covariance 
matrix (or its estimation) of hidden units outputs. The number of significantly large eigenvalues of this 
matrix is the necessary number of hidden units. 


9.3.3.1 Penalty Function Methods 


The idea behind penalty function methods is the modification of the learning criterion, J,, by adding a 
component I'(v), which punishes for redundancy architecture elements: 


TiQ(ywaye¥) =It way Y) + YE), (9.13) 


where y is a penalty coefficient. Usually, the correction of network free parameters is conducted in two 
stages. First, new values of v’ are calculated using standard learning methods (e.g., the BP algorithm), 
next, these values are corrected as follows: 


v=v(1-nyIx(v’)), (9.14) 


where 1 is a learning factor. 
There are two attitudes to penalty function I'(v) design in MLP networks: 


¢ Penalty for redundancy synaptic connections 
¢ Penalty for redundancy hidden units 
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In the first case [20], the penalty function can be defined in a different form: 


T(w)=||w|P, Teg Hh (9.15) 
1 we 1 
T(w) = an ee : (9.16) 
ww) pare + w?2yP 
; oe 1+2 we 
Tw) = aS ee Laer (9.17) 


where Tx, isa function that corrects the weight of the connection (j, i) (9.14). The first of the func- 
tions I'(w), (9.15), is a penalty for too high values of weights. A disadvantage of this method is that it 
corrects all weights to the same extent, even when the problem solution specification needs weights 
of high values. In the case (9.16), this problem is avoided. It is easy to see that the expression (9.16) is 
similar to (9.15) for low values of weights and is negligibly small for high values. The expression (9.17) 
preserves the properties of (9.16), and, additionally, it eliminates units whose norm of the weight vec- 
tor is near zero. 

Units whose activation changes to a small extent during the learning process can be considered as 
redundant [5]. Let A» be the activation change of the ith hidden unit after the presentation of the pth 
learning pattern; thus the penalty function can be chosen in the following form: 


Tw)= YY e(A?,), (9.18) 
ae 2 


where the internal summation is conducted over all learning patterns and the external summation over 
all hidden units. There are many possibilities of e(A;,) definition. One looks for such a function e(A;,) 
that a small activation change forces big parameter corrections, and vice versa. Because the weight cor- 
rections corresponding to the penalty function are proportional to its partial derivatives over individual 
Aj'», the above property is met when 
de(Ajy) 
ETE 
OA, (1+ Ajy)” 


(9.19) 


The exponent n controls the penalty process. The higher it is, the better the elimination of a hidden unit 
with low activation. 


9.3.3.2 Sensitivity Methods 
The sensitivity of the synaptic connection (j, i) is defined as its elimination influence on the value of J;. 


There are some different sensitivity measures in the literature. Mozer and Smolensky [35] introduce into 
the model of the ith neuron an additional set of coefficients {0,,}: 


na=f SY) wey > (9.20) 
j 
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where 
the summation j is done over all input connections of the ith unit 
w,, is a weight of the jth input of the unit considered 
y; is its output signal 
u, is the jth input signal 
FQ represents the activation function 


If a, = 0, then the connection is removed, while for 0, = 1 this connection exists in a normal sense. 
The sensitivity measure is defined as follows: 


Ss =-=— | - (9.21) 


7 Noy =1 


Karnin [23] proposes a simpler definition. Sensitivity is a difference between quality measures of a full 
network J, and after the connection (j, i) removing J?: 


sy =—-Ur -J}). (9.22) 


This idea was analyzed and developed in [45]. 

One of the most well-known sensitivity methods is the optimal brain damage (OBD) algorithm [29]. 
If all weights of the network are described by one vector w, then the Taylor series of the network quality 
measure around the current solution w* has the following form: 


Tr(w) —Jr(w*) = VT r(w*)(w — w*) 


+ sw —w*) H(w*\(w —w*) + O(|| w—w* |P), (0.23) 


where H(w*) is the Hessian matrix. Because the weight reduction follows the learning process, it can be 
assumed that w* is an argument of the J; local minimum and V J;(w*) = 0. Ifit is assumed that the value 
of O(llw — w*l?) is negligibly small, then 


Jr(w) — Jr(w*) = 5 —w*)’ H(w*)(w — w*). (9.24) 


The Hessian matrix is usually of great dimension, because even an ANN of a medium size has hun- 
dreds of free parameters, and its calculation is time consuming. In order to simplify this problem, it is 
assumed [29] that the diagonal elements of H(w*) are predominant. Thus 


oso 1 0°Fr 2 
" 20w, 


OBD _ 


(9.25) 


The calculation of the diagonal elements of the Hessian matrix shown in (9.25) is based on the applica- 
tion of the back-propagation technique to the second derivatives. The activation u, ; of the ith unit of the 
layer V, has the form 


uni = f Siti) > (9.26) 
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where n,_, = card(V;_,) is the number of units in the previous layer. In this notation, u,, describes the 
ith input signal or the whole network, and uy,; = yyy, is the ith output signal of the network. It is easy to 
see that (see 9.26) 


Or _ OTe (9.27) 


a Ui. 1 
aw? Ur 


Finally, second derivatives of the quality criterion over input signals to individual neurons have the 
following form: 


¢ For the output layer Vy, 


ie ein ( Ir ) 
= (f’) + ” (9.28) 
OU; nas f OYA, f 
¢ For other layers (V,|L = 1...M —- 1) 
=(f" rin ce op aa ae (0.29) 
Yeaee ou ULsik 


The improved version of the OBD method is the optimal brain surgeon (OBS) algorithm [18]. In this 
method, the elimination of the ith weight vector w* (9.24) is treated as a step of the learning process, in 
which a newly obtained weight vector w differs from w* by only one element—the removed weight, w,. 
Thus, the following relation is fulfilled: 


e/ (w—w*)+w; =0, (9.30) 


where e; is a unit vector with one on the ith location. The problem is to find the weight that meets the 
following condition: 


w; = arg min( Low —w*)" H(w*)(w— v9] (9.31) 


and the relation (9.30). It is proved in [18] that the weight w,, which meets the condition (9.31) is also an 
argument of the minimum of the following criterion: 


2 
OBS _ 1 wi; 


i 3 (i), (9.32) 


The connection (j, i), for which the value of Si (9.21), (9.22), (9.25) or (9.32) is the smallest, is pruned. 
Apart from the OBS method, the network is trained again after pruning. In the OBS method, the weight 
vector is corrected (see (9.31) and (9.30)): 


Aw=—_ He, (9.33) 
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Another solution was proposed in [4], where units are pruned also without subsequent retraining. The 
method is based on the simple idea of iteratively removing hidden units and then adjusting the remain- 
ing weights while maintaining the original input-output behavior. If a chosen hidden unit (kth) is 
pruned, then all its input and output connections are also removed. Let the ith unit be a successor of the 
kth unit. We want to keep the value of the ith unit output. In order to achieve this goal, weights of others 
input synaptic connections have to be corrected so that the following relation holds: 


>; Wut}? = > (wi — Buy? (9.34) 


for all learning patterns Lt. Ky, denotes the set of indices of units preceding the ith unit (before prun- 
ing), and 6,, is the correction of the weight w;,, Equation 9.34 can be reduced to a set of linear equations: 


SY) yu = wae. (9.35) 
{k} 


jeKy, — 


Castellano and coworkers [4] solved this set of equations in the least-square sense using an efficient 
preconditioned conjugate gradient procedure. 


9.3.3.2.1 Methods Based on Information Analysis 


Let us consider an ANN of the regression type: one hidden layer of units with a sigmoid activation func- 
tion and one output linear unit. Generalization to a network with many outputs is simple. Let us assume 
that the network has learned some input-output relation with a given accuracy. Let wu}, be an input 
signal of the ith hidden unit from the Pth learning pattern. The covariance matrix C connected with the 
outputs of the hidden layer, and calculated over whole learning set, has the form 


card(®;) 


C= =e) 2, (uh, a! \(uh,; -1) ; (9.36) 


d(®;) 
ah 1 .r h é ran, é ane : : 
where uj = aap y : uy. The covariance matrix is symmetric and positive semi-defined; thus it 
5s 


can be transformed to a diagonal form using some orthonormal matrix U: 
C=Udiag(A,|i=1,....2)U’, (9.37) 


where n is the number of hidden units. 

Topological optimization at the neural level can be done using the analysis of the C eigenvalues 
(A,|i = 1, .... 2), where one assumes that the number of pruned units is equal to the number of negli- 
gible low eigenvalues [57]. However, the network has to be retrained after units pruning, and there 
is no direct relation between pruned units and pointed negligible eigenvalues of C. 

Alippi and coworkers [1] propose different optimization procedure, during which the network gen- 
eralization ability increases without retraining. This goal is achieved by introducing a virtual layer. This 
layer is located between the hidden layer and the output unit, and it possesses the same number of units 
nas the hidden layer (rys. 5). Weights of connections between the hidden and virtual layers are chosen 
in the form of the matrix U, and between the virtual layer and output unit they are equal to U'w, where 
w is the weight vector, obtained during the training process, of connections between the hidden layer 
and the output unit. In this way the network output does not change: yy,, = w'U(U'u") = wu". It is 
easy to see that the covariance matrix C, corresponding to outputs of virtual units is diagonal (9.37): 
C, = diag(A,|i = 1, ..., n). It means that the outputs of virtual units are independent. If the variance A, is 
negligibly small, then one can assume that A, = 0; thus, the ith virtual unit, independently off an input 
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signal, has a constant output uy. This value can be added to the bias of the output unit and the virtual 
unit considered can be pruned. Such a process is repeated until the generalization ability of the net- 
work increases, that is, until J; decreases. The above method finds the optimal network architecture in 
the sense of the generalization ability. In the sense of the minimization of the number of network free 
parameters, such a network is still redundant. 

The OBD and OBS methods use the second order Taylor expansion of the error function to esti- 
mate its changes when the weights are perturbed. It is assumed that the first derivative is equal to zero 
(pruning after learning—locally optimal weights) and the error function around the optimum can be 
treated as a quadratic function. The OBD method assumes additionally that the off-diagonal terms of 
the Hessian matrix are zero. 

Engelbrecht [8] showed that objective function sensitivity analysis, which is the main idea of the 
OBD, can be replaced with output sensitivity analysis. His pruning algorithm based on output sensitiv- 
ity analysis involves a first order Taylor expansion of the ANN output. The basic idea is that a parameter 
with low average sensitivity and with negligible low sensitivity variance taken over all learning patterns 
has a negligible effect on the ANN output. Lauret et al. [28] also propose an output sensitivity analysis, 
but theirs is based on the Fourier amplitude sensitivity test (FAST) method described in [49]. 

The ANN considered in [11,46] is also of regression type (Figure 9.5a). The method proposed in [46] 
consists of two phases. In the first phase (the so-called additive phase), the procedure starts with the 
smallest architecture—with one hidden unit, and subsequent units are added iteratively until the biggest 
possible network for the problem considered is known. For each structure, the condition number K(Z) 
of the Jacobian matrix 


2 Of (x, v) 


Z, 
ov 


(9.38) 


of the approximated function f(x,v) over network parameters v is calculated using singular value 
decomposition (SVD). If the condition number K(Z) > 108, the procedure is stopped. Moreover, the 


JNA 


FIGURE9.5 (a) Regression type network and (b) its version with the virtual layer. 
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multi-start Levenberg—Marquardt method is used in order to estimate network free parameters, and 
the residual signal and the estimated generalization error are remembered. This phase is similar to 
that presented by Fukumizu [11], where the procedure is theoretically based on the theory of statistical 
learning and the theory of measurement optimization. If the condition number in [46] exceeds 108, then 
the inverse (zz), which is an estimation of the Fisher matrix, is less then 10-'° (under the floating- 
point precision in standard computers). When the first phase is stopped and the second is started [46], 
Fukumizu [11] carries out integrate tests for each hidden node and prunes redundant unit in accordance 
with the proposed sufficient conditions of Fisher matrix singularity. The aim of the second phase [46] is 
removing first redundant units and next redundant connections between input and hidden layers. The 
statistical hypothesis for each architecture obtained in the first phase is tested. This hypothesis checks 
whether the family of functions represented by the neural model contains the approximated function. 
In order to do it, the estimator of Fisher distribution based on (9.8) is calculated. If the Gaussian noise 
is assumed and the null hypothesis cannot be rejected, then we have the proof that the residual signal 
contains only random disturbances. The ANN architecture that fulfils the above test is still simplified 
by pruning redundant connections between the input and hidden layer. This task is also solved using 
statistical hypothesis testing described in detail in [46]. 


9.3.4 Discrete Optimization Methods 


The space of ANN architectures is infinite discrete. The main problem is to choose a representation of 
each architecture and to order them in a structure comfortable for searching. The most popular method 
is encoding the network architecture in the sequence of symbols from a finite alphabet. Also, the graph, 
tree, and matrix representations are implemented by researchers in order to find the best possible ANN 
architecture. 


9.3.4.1 Evolutionary Algorithms 


The application of evolutionary algorithms to the ANN design process has about 20 years of history. 
These algorithms as global optimization algorithms can be used in neural networks in three tasks: 


¢ Selection of parameters of the ANN with a fitted structure (learning process) 

¢ Searching for the optimal ANN architecture—the learning process is conducted using other 
methods 

¢ Application of evolutionary algorithms to both of the above tasks simultaneously 


The last two tasks are the subject of this chapter. The most popular class of evolutionary algorithms is 
GA, which seems to be the most natural tool for a discrete space of ANN architectures. This fact results 
from the classical chromosome structure—a string of bits. Such a representation is used in many appli- 
cations (c.f. [3,17,19,50]). Initially, an ANN architecture, NA,,,,, Sufficient to represent an input-output 
relation is selected. This architecture defines the upper limit of ANN architecture complexity. Next, 
all units from the input, hidden, and output layers of NA,,,, are numbered from 1 to N. In this way, 
the searching space of architectures is reduced to a class of digraphs of N nodes. An architecture NA 
(a digraph) is represented by an incidence matrix V of N? elements. Each element is equal to 0 or 1. If 
V, = 1 then the synaptic connection between the ith and the jth node exists. The chromosome is cre- 
ated by rewriting the matrix V row by row to one binary string of length N”. If the initial population 
of chromosomes is randomly generated, then the standard GA can be applied to search for the optimal 
ANN architecture. 

It is easy to see that the chromosome created by the above procedure can represent any ANN architec- 
ture, also with backward connections. If one wants to delimit searching to MLP architectures, then the 
matrix V contains many elements equal to 0, which cannot be changed during the searching process. In 
this case, the definition of genetic operators is complicated and a large space of memory is unnecessarily 
occupied. It is sensible to omit such elements in the chromosome [44]. 
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In practice, the neural networks applied contain hundreds to thousands of synaptic connections. 
Standard genetic operators working on such long chromosomes are not effective. Moreover, when the 
complexity of the ANN architecture increases, the convergence of the evolutionary process decreases. 
Thus, many researchers look for representations of ANN architectures that simplify the evolutionary 
process. In [34], the architecture of the ANN is directly represented by the incidence matrix V. The 
crossover operator is defined as a random exchange of rows or columns between two matrices of popula- 
tion. In the mutation operator, each bit of the matrix is diverted with some (very low) probability. 

The above method of genotypic representation of the ANN architecture is called direct encoding [25]. 
It means that there is a possibility to interpret each bit of the chromosome directly as to whether or 
not a concrete synaptic connection exists. The disadvantage of these methods is too low convergence 
in the case of a complex ANN architecture searched for or a complete loss of convergence in the limit. 
Moreover, if the initial architecture is of great size, then the searching process does not find an optimal 
solution but is only characterized by some reduction level of the network. In these cases, the quality 
measure of the searching method can be defined by the so-called compression factor [44], defined as 


n* 


max 


K= 


x 100%, (9.39) 


where 
1* is the number of synaptic connections of the architecture obtained by the evolutionary process 
Nmax 18 the maximal number of synaptic connections permissible for the selected representation of 
the ANN architecture 


Methods of indirect encoding were proposed in papers [25,27,32]. In [32], an individual of the popu- 
lation contains the binary code of network architecture parameters (the number of hidden layers, the 
number of units in each hidden layer, etc.) and parameters of the back-propagation learning process (the 
learning factor, the momentum factor, the desired accuracy, the maximum number of iterations, etc.). 
Possible values of each parameter belong to a discrete, finite set whose numerical force is determined by 
a fixed number of bits of this parameter representation. In this way, the evolutionary process searches 
for the optimal ANN architecture and the optimal learning process simultaneously. 

Another proposal [25] is encoding based on the graph creation system. Let the searching space be 
limited to ANN architectures of 2’*! units at the very most. Then the incidence matrix can be repre- 
sented by a tree even of high h, where each element either possesses four descendants or is a leaf. Each 
leaf is one of 16 possible binary matrices of size 2x2. The new type of individual representation needs 
new definitions of crossover and mutation operators, both of which are explained in Figure 9.6. 


fr Ay 
ee oe 
Crossover 
Random 
subtree > 


FIGURE 9.6 Genetic operators for tree representations of chromosomes—crossover and mutation. 
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Koza and Rice [27] propose quite a different approach to the MLP creation process based on genetic 
programming (GP). The single output of the network can be described as follows: 


Veet py wyuy | (9.40) 
i 
The output of each network processing unit can be expressed by the formulae 
ut =f] ° wis £$19....M=2 G12)... (9.41) 
j 


The expressions (9.40) and (9.41) can be represented by a hierarchical tree of operators (nodes): 
F={f,W,+,—*,%} (9.42) 
and terms (leaves): 
T ={uy,...,U,,R}, (9.43) 


where 
fis a nonlinear activation function 
W is a weight function (the product of the signal and weight) 


Both functions can have various numbers of arguments. Other elements of F are the ordinary operators 
of addition (+), subtraction (—), and multiplication (*); the operator (%) is the ordinary division apart 
from the division by 0—in this case the output is equal to 0. The set of terms (9.43) contains network 
input signals and some atomic floating-point constant R. The MLP is represented by a tree with nodes 
selected from the set F (9.42), and leaves selected from the set T (9.43). The population of trees is exposed 
to the evolutionary process, which uses the genetic operators presented in Figure 9.6. 

A* algorithm. The A* algorithm [37] is a way to implement the best-first search to a problem graph. 
The algorithm will operate by searching a directed graph in which each node n, represents a point in the 
problem space. Each node will contain, in addition to a description of the problem state it represents, an 
indication of how promising it is, a parent link that points back to the best node from which it came, and 
a list of the nodes that were generated from it. The parent link will make it possible to recover the path 
to the goal once the goal is found. The list of successors will make it possible, if a better path is found to 
an already existing node, to propagate the improvement down to its successors. 

A heuristic function f(n,) is needed that estimates the merits of each generated node. In the A* 
algorithm, this cost function is defined as a sum of two components: 


f(nj) = g(n)) +h(n;), (9.44) 


where g(n;) is the cost of the best path from the start node n, to the node n, and it is known exactly to 
be the sum of the cost of each of the rules that were applied along the best path from n, to n,, and h(n)) is 
the estimation of the addition cost getting from the node n, to the nearest goal node. The function h(n,) 
contains the knowledge about the problem. 
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In order to implement the A* algorithm to the MLP architecture optimization process [7], the follow- 
ing components have to be defined: 


The set, G, of goal architectures 

The expansion operator E: A — 24, which determines the set of network architectures being 
successors of the architecture NA€ A 

The cost function g(NA, NA’) connected with each expansion operation 

The heuristic function h(NA) 


The goal MLP architecture is obtained if the learning process is finished with a given accuracy: 


G=|NAe AlmaxJr(yxays¥)$ Mo |, (9.45) 
T 


where Ny is a chosen nonnegative constant. 
The expansion operator = generates successors in a twofold way: 


By adding a hidden layer—the successor NA’ of the NA is obtained by adding a new hidden layer 
directly in front of the output layer with the number of hidden units equal to the number of out- 
put units. 

By adding a hidden unit—it is assumed that the current architecture NA possesses at least one 
hidden layer, the successor NA’ is created by adding a new hidden unit to the selected hidden 
layer. 


In this way, the space of the MLP architecture is ordered into the digraph presented in Figure 9.7. 
It can be proved that there exist sets of free parameters for successors NA’ of NA created such 
that [7] 


Increasing number of units in first layer 


VNA’E€E(NA) Av’: Ir ynarw sv) SI rays y)- (9.46) 


> 


Increasing number of layers 


FIGURE 9.7 Digraph of MLP architectures for single input and a single output problem. p-q-r describes the 
architecture with p units in the first hidden layer, r units in the second one, and q units in the third one. 
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Each expansion NA’ € =(NA) is connected with the increasing of the cost vector: 


(9.47) 


o(NA,NAY) = kK (NA’) — 9; at 


8:(NA’) — 0)(NA) 


where 
M-1 
0,(NA) = card(V;) is the number of hidden units 
i=l 
©, = M - 1 is the number of hidden layers of the NA 


‘The effectiveness of the A* algorithm strongly depends on the chosen definition of the heuristic function 
h(NA). For the MLP architecture optimization problem, Doering and co-workers [7] propose it as follows: 


ote, (NA) + Be; (NA) 
h(NA) = J ‘ Ber , (9.48) 
oe, (NAo) + Ber(NAo) 0 
where 
1 
eur) = ree a Tia OU waysY) 


is the mean error obtained for the learning (testing) set; moreover, & + B = 1 and a, B 2 0. A big learning 
error significantly influences the heuristic function in the case of architectures near the initial architec- 
ture NA). This error decreases during the searching process and the selection of successor is dominated 
by the generalization component. 

In order to compare different goal functions (9.44), the relation of the linear order < of two vectors a, 
be R? must be defined: 


(a<b) & (@ $b) v (a =by) A (a <br). (9.49) 


The A* algorithm is a very effective tool for MLP architecture optimization. The advantage of this algo- 
rithm over the cascade-correlation algorithm has been shown [7]. However, its computational complex- 
ity is very large in the case of complex goal architectures, because the number of successors increases 
very fast with the current architecture complexity. This problem is especially visible in the case of net- 
works of dynamic units [38]. 


9.3.4.2 Simulated Annealing and Tabu Search 


Simulated annealing (SA) [24] is based on the observation of the crystal annealing process, which has to 
reduce crystal defects. The system state is represented by a point S in the space of feasible solutions of a 
given optimization problem. The neighboring state S’ of the state S differs from S only in one parameter. 
The minimized objective function E is called the energy by the physical analogy, and the control param- 
eter T is called the temperature. The SA algorithm states the following steps: 


. Choose the initial state S = S, and the initial temperature T = T). 

. Ifthe stop condition is satisfied, then stop with the solution S, else go to 3. 

. Ifthe equilibrium state is achieved, go to 8, else go to 4. 

. Randomly choose a new neighboring state S’ of the state S. 

. Calculate AE = E(S’) - E(S). 

. If AE <0 orX < exp(-AE/T), where X is a uniformly distributed random number from the interval 
[0,1), then S = S’. 

7. Go to 3. 

8. Update T and go to 2. 


Nn WHY 
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The nonnegative temperature (T > 0) allows us to choose the state S’, whose energy is higher than the 
energy of the actual state S, as a base state for the further search, and then there is a chance to avoid get- 
ting stuck in a local optimum. Dislocations, which deteriorate the system energy, are controlled by the 
temperature T. Their range and occurring frequency decrease with T — 0. As the equilibrium state can 
be chosen a state, in which the energy almost does not change (with a given accuracy, which is a function 
of temperature) in a given time interval. This criterion is relatively strong and cannot be accomplished. 
So, usually, the number of iterations is fixed for a given temperature. The initial temperature is the 
measure of the maximal “thermal” fluctuations in the system. Usually, it is assumed that the chance 
of achieving any system energy should be high at the beginning of the searching process. The linear 
decreasing of the temperature is not recommended. The linear annealing strategy causes the exponential 
decrease of “thermal” fluctuations, and the searching process usually gets stuck in a local optimum. Two 
annealing strategies are recommended: 


To 
T(t) =4 1+Int, , (9.50) 
OT (t,-1) 


where 
t, is the number of temperature updating 
a € [0,1] is a given constant 


The annealing strategy determines the stop condition. If the strategy (9.50) is used, the process is stopped 
when the temperature is almost equal to 0 (T <). 

The tabu search metaheuristic was proposed by Glover [13]. This algorithm models processes existing 
in the human memory. This memory is implemented as a simple list of solutions explored recently. The 
algorithm starts from a given solution, x, which is treated as actually the best solution x* < x. The tabu 
list is empty: T = ©. Next, the set of neighboring solutions are generated, excluding solutions noted in 
the tabu list, and the best solution of this set is chosen as a new base point. If x’ is better than x*, then 
x* — x’, The actual base point x’ is added to the tabu list. This process is iteratively repeated until a given 
criterion is satisfied. There are many implementations of the tabu search idea, which differ between each 
other in the method of tabu list managing, for example, the tabu navigation method (TNM), the cancel- 
lation sequence method (CSM), and the reverse elimination method (REM). A particular description of 
these methods can be found in [14]. 

First, the SA algorithm was implemented for neural-network learning. But, in most cases, the evolu- 
tionary approaches performed better than SA (e.g., [52,54]). In the case of ANN architecture optimiza- 
tion, the SA and tabu search algorithms were applied in two ways. In the first one, the architecture is 
represented by a binary string, in which each position shows whether or not a given synaptic connection 
exists. The energy function for SA is the generalization criterion (9.8). Such a defined SA algorithm 
has given better results than the GA, based on the same chromosome representation [41]. A similar 
representation, enlarged by an additional real vector of network free parameters, was implemented for 
SA and the tabu search in [28] for simultaneous weight and architecture adjusting. Both algorithms are 
characterized by slow convergence to optimal solution. Thus, Lauret et al. [28] propose a very interesting 
hybrid algorithm that exploits the ideas of both algorithms. In this approach, a set of new solutions is 
generated, and the best one is selected according to the cost function, as performed by the tabu search. 
But the best solution is not always accepted since the decision is guided by the Boltzmann probability 
distribution as it is done in SA. Such a methodology performed better than the standard SA and tabu 
search algorithms. The other idea is the cascade reduction [39]. One starts with a network structure that 
is supposed to be sufficiently complex and reduces it using a given algorithm. Thus, we obtain a network 
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with 1*(0) parameters from 1),,,,(0). In the next step, we assume that ,,,,(1) = 1*(0) and apply reduction 
again. This process is repeated until 1*(k) = Nnax(K)\(E1*(k - 1)). 

In the second approach of the SA and tabu search algorithms to ANN architecture representation, both 
algorithms search the graph of network architectures, which has been used for the A* algorithm (Figure 9.7) 
[41-43]. The neighboring solutions of a given architecture are all of its predecessors and successors. 


9.4 Summary 


The ANN architecture optimization problem is one of the most basic and important subtasks of neural 
application design. Both insufficiency and redundancy of network processing units lead to an unsatis- 
factory quality of the ANN model. Although the set of solutions proposed in the literature is very rich, 
especially in the case of feed-forward ANNs, there is still no procedure that is fully satisfactory for 
researchers. Two types of methods have been exploited in the last years: methods based on information 
analysis and discrete optimization algorithms. The methods of the first class are mathematically well 
grounded, but they are usually dedicated to simple networks (like regression networks), and their appli- 
cability is limited. Moreover, because most of these methods are grounded on the statistical analysis 
approach, rich sets of learning patterns are needed. The methods of discrete optimization seem to be 
most attractive for ANN structure design, especially in the case of dynamic neural networks, which still 
expect efficient architecture optimization methods. 
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10.1 Introduction 


Parity-N problems have been studied deeply in many literatures [WT93,AW95,HLS99,WH10]. The N-bit 
parity function can be interpreted as a mapping (defined by 2 binary vectors) that indicates whether the 
sum of the N elements of every binary vector is odd or even. It is shown that threshold networks with 
one hidden layer require N hidden threshold units to solve the parity-N problem [M61,HKP91,WH03]. 
If the network has bridged connections across layers, then the number of hidden threshold units can 
be reduced by half. In this case, only N/2 neurons are required in the hidden layer for the parity-N 
problem [M61]. After that, Paturi and Saks [PS90] showed that only N/log,N neurons are required. Siu, 
Roychowdhury, and Kailath [SRT91] showed that when one more hidden layer is introduced, the total 
number of hidden units could be only 2/N F 

In this chapter, the parity-N problem is solved by different networks, so as to compare the efficiency 
of neural architecture. 

One may notice that, in parity problems, the same value of sum of all inputs results with the same 
outputs. Therefore, considering all the weights on network inputs as “1,” the number of training patterns 
of parity-N problem can be reduced from 2" to N+1. 

Figure 10.1 shows both the original eight training patterns and the simplified four training patterns, 
which are identical. 

Based on this pattern simplification, a linear neuron (with slope equal to 1) can be used as the network 
input (see Figure 10.2b). This linear neuron works as a summator. It does not have bias input and does 
not need to be trained. 
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FIGURE 10.1 Training simplification for the parity-3 problem: (a) original patterns and (b) simplified patterns. 
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FIGURE 10.2 Two equivalent networks for the parity-3 problem: (a) parity-3 inputs and (b) linear neuron inputs. 


10.2 MLP Networks with One Hidden Layer 


Multilayer perceptron (MLP) networks are the most popular networks, because they are regularly 
formed and easy for programming. In MLP networks, neurons are organized layer by layer and there 
are no connections across layers. 

Both parity-2 (XOR) and parity-3 problems can be visually illustrated in two and three dimensions 
respectively, as shown in Figure 10.3. 

Similarly, using MLP networks with one hidden layer to solve the parity-7 problem, there could be 
at least seven neurons in the hidden layer to separate the eight training patterns (using a simplification 
described in introduction), as shown in Figure 10.4a. 

In Figure 10.4a, eight patterns {0, 1, 2, 3, 4, 5, 6, 7} are separated by seven neurons (bold line). The 
thresholds of the hidden neurons are {0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5}. Then summing the outputs of hid- 
den neurons weighted by {1, -1, 1, -1, 1, -1, 1}, the net inputs at the output neurons could be only {0, 1}, 
which can be separated by the neuron with threshold 0.5. Therefore, parity-7 problem can be solved by 
the architecture shown in Figure 10.4b. 

Generally, if there are n neurons in MLP networks with a single hidden layer, the largest possible 
parity-N problem that can be solved is 


N=n-1 (10.1) 


where 
n is the number of neurons 
Nis the parity index 
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FIGURE 10.3. Graphical interpretation of pattern separation by hidden layer and network implementation using 
unipolar neurons for (a) XOR problem and (b) parity-3 problem. 
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FIGURE 10.4 Solving the parity-7 problem using MLP network with one hidden layer: (a) analysis and 
(b) architecture. 
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10.3 BMLP Networks 


In MLP networks, if connections across layers are permitted, then networks have bridged multilayer 
perceptron (BMLP) topologies. BMLP networks are more powerful than traditional MLP networks. 


10.3.1 BMLP Networks with One Hidden Layer 


Considering BMLP networks with only one hidden layer, all network inputs are also connected to the 
output neuron or neurons. 

For the parity-7 problem, the eight simplified training patterns can be separated by three neurons to 
four subpatterns {0, 1}, {2, 3}, {4, 5}, and {6, 7}. The threshold of the hidden neurons should be {1.5, 3.5, 
5.5}. In order to transfer all subpatterns to the unique pattern {0, 1} for separation, patterns {2, 3}, {4, 5}, and 
{6, 7} should be reduced by 2, 4, and 6 separately, which determines the weight values on connections 
between hidden neurons and output neurons. After pattern transformation, the unique pattern {0, 1} can 
be separated by the output neuron with threshold 0.5. The design process is shown in Figure 10.5a and 
the corresponding solution architecture is shown in Figure 10.5b. 

For the parity-11 problem, similar analysis and related BMLP networks with single hidden layer solu- 
tion architecture are presented in Figure 10.6. 

Generally, for m neurons in BMLP networks with one hidden layer, the largest parity-N problem that 
can be possibly solved is 


N=2n-1 (10.2) 


10.3.2 BMLP Networks with Multiple Hidden Layers 


If BMLP networks have more than one hidden layer, then the further reduction of the number of neu- 
rons are possible, for solving the same problem. 

For the parity-11 problem, using 4 neurons, in both 11 = 2=1=1 and 11 =1=2=1 architectures, 
can find solutions. 

Considering the 11 = 2 = 1 = 1 network, the 12 simplified training patterns would be separated by two 
neurons at first, into {0, 1, 2, 3}, {4, 5, 6, 7}, and {8, 9, 10 11}; the thresholds of the two neurons are 3.5 and 7.5, 
separately. Then, subpatterns {4, 5, 6, 7} and {8, 9, 10, 11} are transformed to {0, 1, 2, 3} by subtracting —4 
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FIGURE 10.5 Solving the parity-7 problem using BMLP networks with one hidden layer: (a) analysis and 
(b) architecture. 
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FIGURE 10.6 Solving the parity-11 problem using BMLP networks with single hidden layer: (a) analysis and 
(b) architecture. 


and —8 separately, which determines the weight values on connections between the first hidden layer and 
followed layers. In the second hidden layer, one neuron is introduced to separate {0, 1, 2, 3} into {0, 1} and 
{2, 3}, with threshold 1.5. After that, subpattern {2, 3} is transferred to {0, 1} by setting weight value as —2 
on the connection between the second layer and the output layer. At last, output neuron with threshold 0.5 
separates the pattern {0, 1}. The whole procedure is presented in Figure 10.7. 

Figure 10.8 shows the 11 = 1 = 2 = 1 BMLP network with two hidden layers, for solving the parity-11 
problem. 

Generally, considering the BMLP network with two hidden layers, the largest parity-N problem can 
be possibly solved is 


N=2(m+1)(n+1)-1 (10.3) 


where m and n are the numbers of neurons in the two hidden layers, respectively. 
For further derivation, one may notice that ifthere are k hidden layers and n, is the number of neurons 
in related hidden layer, where iis ranged from 1 to k, then 


N =2(m +1)(m2 +1)-++(g-1 +1)(™| +1) -1 (10.4) 
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FIGURE 10.7 Solving the parity-11 problem using BMLP networks with two hidden layers, 11=2=1=1: (a) analysis 
and (b) architecture. 


© 2011 by Taylor and Francis Group, LLC 


10-6 Intelligent Systems 


0 
1 
2 
3 
4. Input 1 
5 Input 2 
@) — Input 3 
6 Input 4 
7 Input 5 
Input 6 
8 Input 7 
9 Input 8 
Input 9 
10 Input 10 
11 Input 11 7 \ 


Weights = (—5.5, -1.5, -3.5) 


(a) (b) 


FIGURE 10.8 Solving the parity-11 problem using BMLP networks with two hidden layers, 11=1=2=1: (a) analysis 
and (b) architecture. 


10.4 FCC Networks 


Fully connected cascade (FCC) networks can solve problems using the smallest possible number of neu- 
rons [W09]. In the FCC networks, all possible routines are weighted, and each neuron contributes to a layer. 

For parity-7 problem, the simplified eight training patterns are divided by one neuron at first, 
as {0, 1, 2, 3} and {4, 5, 6, 7}; the threshold of the neuron is 3.5. Then the subpattern {4, 5, 6, 7} is 
transferred to {0, 1, 2, 3} by weights equal to —4, connected to the followed neurons. Again, by using 
another neuron, the patterns in the second hidden layer {0, 1, 2, 3} can be separated as {0, 1} and 
{2, 3}; the threshold of the neuron is 1.5. In order to transfer the subpattern {2, 3} to {1, 2}, 2 should 
be subtracted from subpattern {2, 3}, which determines that the weight between the second layer and 
the output layer is —2. At last, output neurons with threshold 0.5 is used to separate the pattern {0, 1}, 
see Figure 10.9. 

Figure 10.10 shows the solution of parity-15 problem using FCC networks. 

Considering the FCC networks as special BMLP networks with only one neuron in each hidden layer, 
for n neurons in FCC networks, the largest N for parity-N problem can be derived from Equation 10.4 as 


N=20+)D0+1)---+)0+1-1 (10.5) 
n-1 
or 
N=2"-1 (10.6) 
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FIGURE 10.9 Solving the parity-7 problem using FCC networks: (a) analysis and (b) architecture. 
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FIGURE 10.10 Solving the parity-15 problem using FCC networks: (a) analysis and (b) architecture. 


FIGURE 10.11 


TABLE 10.1 Different Architectures for Solving the Parity-N Problem 
Network Structure Parameters Parity-N Problem 
MLP with single hidden layer —_n neurons n-1 

BMLP with one hidden layer —n neurons 2n+1 


BMLP with multiple hidden 


layers 
FCC 


h hidden layers, each 
with n, neurons 


2(my + 1) (ny +:1)-+-( 4 + 1) (m+ 1-1 


nm neurons z=1 


© MLP with 1 hidden layer 
© BMLP with 1 hidden layer 
O BML? with 2 hidden layer 
4 BMLP with 3 hidden layer 
x FCC 


Value of N in parity-N problem 


3 4 5 6 7 8 9 10 
Number of neurons 


Efficiency comparison among various neural network architectures. 
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10.5 Comparison of Topologies 


Table 10.1 concludes the analysis above, for the largest parity-N problem that can be solved with a given 
network structure. 
Figure 10.11 shows comparisons of the efficiency of various neural network architectures. 


10.6 Conclusion 


This chapter analyzed the efficiency of different network architectures, using parity-N problems. Based 
on the comparison in Table 10.1 and Figure 10.11, one may notice that, for the same number of neu- 
rons, FCC networks are able solve parity-N problems with the least number of neurons than other 
architectures. 

However, FCC networks also have the largest number of layers and this makes them very difficult to 
be trained. For example, few algorithms can be so powerful to train the parity-N problem with the given 
optimal architectures, such as four neurons for the parity-15 problem. So, the reasonable architecture 
would be the BMLP network with couple hidden layers. 
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11.1 Introduction 


The concept of systems that can learn was well described over halfa century ago by Nilsson [N65] in his book 
Learning Machines where he summarized many developments of that time. The publication of the Mynsky 
and Paper [MP69] book slowed down artificial neural network research, and the mathematical founda- 
tion of the back-propagation algorithm by Werbos [W74] went unnoticed. A decade later, Rumelhart et al. 
[RHW86] showed that the error back-propagation (EBP) algorithm effectively trained neural networks 
[WT93,W K00,W07,FAEC02,FANOI1]. Since that time many learning algorithms have been developed and 
only a few of them can efficiently train multilayer neuron networks. But even the best learning algorithms 
currently known have difficulty training neural networks with a reduced number of neurons. 

Similar to biological neurons, the weights in artificial neurons are adjusted during a training pro- 
cedure. Some use only local signals in the neurons, others require information from outputs; some 
require a supervisor who knows what outputs should be for the given patterns, and other unsupervised 
algorithms need no such information. Common learning rules are described in the following sections. 


11.2 Foundations of Neural Network Learning 


Neural networks can be trained efficiently only if networks are transparent so small changes in weights’ 
values produce changes on neural outputs. This is not possible if neurons have hard-activation functions. 
Therefore, it is essential that all neurons have soft activation functions (Figure 11.1). 


11-1 


© 2011 by Taylor and Francis Group, LLC 


11-2 Intelligent Systems 


Output Output 
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FIGURE 11.1 Neurons in the trainable network must have soft activation functions. 


Neurons strongly respond to input patterns if weights’ values are similar to incoming signals. Let us 
analyze the neuron shown in Figure 11.2 with five inputs, and let us assume that input signal is binary 
and bipolar (-1 or +1). For example, inputs X = [1, -1, 1, -1, -1] and also weights W = [1, -1, 1, -1, -1] 
then the net value 


5 
net =) xiwi =XW' = (11.1) 


This is maximal net value, because for any other input signals the net value will be smaller. For example, 
if input vector differs from the weight vector by one bit (it means the Hamming distance HD = 1), then 
the net = 3. Therefore, 


net = yixm =XW’ =n-2HD (11.2) 


i=l 


where 
n is the size of the input 
HD is the Hamming distance between input pattern X and the weight vector W 


This is true for binary bipolar values, but this concept can be extended to weights and patterns with 
analog values, as long as both lengths of the weight vector and input pattern vectors are the same. 
Therefore, the weights’ changes should be proportional to the input pattern 


AW ~X (11.3) 


In other words, the neuron receives maximum excitation if input pattern and weight vector are equal. 
The learning process should continue as long as the network produces wrong answers. Learning may 
stop if there are no errors on the network outputs. This implies the rule that weight change should be 


out = f (net) 


% 


FIGURE 11.2 Neuron as the Hamming distance classifier. 
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proportional to the error. This rule is used in the popular EBP algorithm. Unfortunately, when errors 
become smaller, then weight corrections become smaller and the training process has very slow asymp- 
totic character. Therefore, this rule is not used in advanced fast-learning algorithms such as LM [HM94] 
or NBN [WCKD08,W09,WY10]. 


11.3 Learning Rules for Single Neuron 
11.3.1 Hebbian Learning Rule 


The Hebb [H49] learning rule is based on the assumption that if two neighbor neurons must be acti- 
vated and deactivated at the same time, then the weight connecting these neurons should increase. For 
neurons operating in the opposite phase, the weight between them should decrease. If there is no signal 
correlation, the weight should remain unchanged. This assumption can be described by the formula 


Aw =Cxi0; (11.4) 


where 
w,,is the weight from ith to jth neuron 
c is the learning constant 
x; is the signal on the ith input 
o, is the output signal 


The training process usually starts with values of all weights set to zero. This learning rule can be used 
for both soft- and hard-activation functions. Since desired responses of neurons are not used in the 
learning procedure, this is the unsupervised learning rule. The absolute values of the weights are usually 
proportional to the learning time, which is undesired. 


11.3.2 Correlation Learning Rule 


The correlation learning rule is based on a similar principle as the Hebbian learning rule. It assumes that 
weights between simultaneously responding neurons should be largely positive, and weights between 
neurons with opposite reaction should be largely negative. 

Contrary to the Hebbian rule, the correlation rule is the supervised learning. Instead of actual 
response, 0;, the desired response, d;, is used for the weight-change calculation 


Awy=cxid; (11.5) 


where d; is the desired value of output signal. This training algorithm usually starts with initialization 
of weights to zero. 


11.3.3 Instar Learning Rule 


If input vectors and weights are normalized, or if they have only binary bipolar values (-1 or +1), then 
the net value will have the largest positive value when the weights and the input signals are the same. 
Therefore, weights should be changed only if they are different from the signals 


Aw; = c(x; — wi) (11.6) 


Note that the information required for the weight is taken only from the input signals. This is a very local 
and unsupervised learning algorithm [G69]. 
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11.3.4 Winner Takes All 


The winner takes all (WTA) is a modification of the instar algorithm, where weights are modified only 
for the neuron with the highest net value. Weights of remaining neurons are left unchanged. Sometimes 
this algorithm is modified in such a way that a few neurons with the highest net values are modified 
at the same time. Although this is an unsupervised algorithm because we do not know what desired 
outputs are, there is a need for a “judge” or “supervisor” to find a winner with a largest net value. The 
WTA algorithm, developed by Kohonen [K88], is often used for automatic clustering and for extracting 
statistical properties of input data. 


11.3.5 Outstar Learning Rule 


In the outstar learning rule, it is required that weights connected to a certain node should be equal to 
the desired outputs for the neurons connected through those weights 


Aw; = c(d; = wi) (11.7) 


where 
d; is the desired neuron output 
c is the small learning constant, which further decreases during the learning procedure 


This is the supervised training procedure, because desired outputs must be known. Both instar and 
outstar learning rules were proposed by Grossberg [G69]. 


11.3.6 Widrow—Hoff LMS Learning Rule 


Widrow and Hoff [WH60] developed a supervised training algorithm that allows training a neuron for 
the desired response. This rule was derived so the square of the difference between the net and output 
value is minimized. The Error, for jth neuron is 


P 


Errorj= >, (netp—d; y (11.8) 


p=l 


where 
P is the number of applied patterns 
d,, is the desired output for jth neuron when pth pattern is applied 
net is given by 


net= )wix, (11.9) 
i=l 


This rule is also known as the least mean square (LMS) rule. By calculating a derivative of Equation 11.8 
with respect to w,, to find the gradient, the formula for the weight change can be found: 


P 
gems =2xj Sy ( dip—netip ) (11.10) 


wi p=l 
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so 


P 
Awy=cxy SY (dp —net wp) (11.11) 
p=i 


Note that weight change, Aw,, is a sum of the changes from each of the individual applied patterns. 
Therefore, it is possible to correct the weight after each individual pattern is applied. This process is 
known as incremental updating. The cumulative updating is when weights are changed after all patterns 
have been applied once. Incremental updating usually leads to a solution faster, but it is sensitive to the 
order in which patterns are applied. If the learning constant c is chosen to be small, then both methods 
give the same result. The LMS rule works well for all types of activation functions. This rule tries to 
enforce the net value to be equal to desired value. Sometimes, this is not what the observer is looking for. 
It is usually not important what the net value is, but it is important if the net value is positive or negative. 
For example, a very large net value with a proper sign will result in a correct output and in a large error 
as defined by Equation 11.8, and this may be the preferred solution. 


11.3.7 Linear Regression 


The LMS learning rule requires hundreds of iterations, using formula (11.11), before it converges to 
the proper solution. If the linear regression is used, the same result can be obtained in only one step 
[W02,AW95]. Considering one neuron and using vector notation for a set of the input patterns X applied 
through weight vector w, the vector of net values net is calculated using 


Xw! =net (11.12) 


where 
X is the rectangular array (n + 1) xp of input patterns 
nis the number of inputs 
pis the number of patterns 


Note that the size of the input patterns is always augmented by one, and this additional weight is respon- 
sible for the threshold (see Figure 11.3). 

This method, similar to the LMS rule, assumes a linear activation function, and so the net values 
should be equal to desired output values d 


n T 
net =). Wii + Wag Xw =d (11.13) 


\-- 


\ Usually, p > n + 1, and the preceding equation can be solved only in 


the least mean square error sense. Using the vector arithmetic, the 
solution is given by 


w=(X"™X) xTd (11.14) 


The linear regression that is an equivalent of the LMS algorithm 
works correctly only for linear activation functions. For typical sig- 
FIGURE 11.3 Single neuron with moidal activation functions, this learning rule usually produces a 
the threshold adjusted by additional wrong answer. However, when it is used iteratively by computing 
weight w,,,,. AW instead of W, correct results can be obtained (see Figure 11.4). 


+1 
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FIGURE 11.4 Single neuron training to separate patterns using LMS rule and using iterative regression with 
sigmoidal activation function. 


11.3.8 Delta Learning Rule 


The LMS method assumes linear activation function net = 0, and the obtained solution is sometimes far 
from optimum as it is shown in Figure 11.4 for a simple two dimensional case, with four patterns belonging 
to two categories. In the solution obtained using the LMS algorithm, one pattern is misclassified. The most 
common activation functions and its derivatives are for bipolar neurons: 


o= f(net) = tanh “ ) f’(o) = 0.5k(1-0°) (11.15) 


and for unipolar neurons 


o= f(net)= f’(0) = ko(1-0) (11.16) 


1+ exp(-k net 


where k is the slope of the activation function at net = 0. If error is defined as 


P 
Error;= ¥. (op dp) (11.17) 


p=l 


then the derivative of the error with respect to the weight w,, is 


dErrorj jf inet! 
=2¥ (0; - dp) es 11.18 
dw yi a - Anet ip ( ) 


where o = f(net) are given by (11.15) or (11.16) and the net is given by (11.10). Note that this derivative is 
proportional to the derivative of the activation function f’(net). 
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In case of the incremental training for each applied pattern 
Awy = cx; fj(dj — 0;) = cx,8; (11.19) 


Using the cumulative approach, the neuron weight, w,, should be changed after all patterns are applied: 


ip 
P P 

Aw; = ox Y dy — Oi) fip = oy 85 (11.20) 
p=l p=l 


The weight change is proportional to input signal x; to the difference between desired and actual 
outputs d;, — o;,, and to the derivative of the activation function fip. Similar to the LMS rule, weights 
can be updated using both ways: incremental and cumulative methods. One-layer neural networks are 
relatively easy to train [AW95,WJ96,WCM99]. In comparison to the LMS rule, the delta rule always 
leads to a solution close to the optimum. When the delta rule is used, then all patterns on Figure 
11.4 are classified correctly. 


11.4 Training of Multilayer Networks 


The multilayer neural networks are more difficult to train. The most commonly used feed-forward 
neural network is multilayer perceptron (MLP) shown in Figure 11.5. Training is difficult because 
signals propagate by several nonlinear elements (neurons) and there are many signal paths. The first 
algorithm for multilayer training was error back-propagation algorithm [W74,RHW86,BUD09,W]K99, 
FFJC09,FFNO1], and it is still often used because of its simplicity, even though the training process is 
very slow and training of close-to-optimal networks seldom produces satisfying results. 


11.4.1 Error Back-Propagation Learning 


The delta learning rule can be generalized for multilayer networks [W74,RHW86]. Using a similar 
approach, the gradient of the global error can be computed with respect to each weight in the network, 
as was described for the delta rule. The difference is that on top of a nonlinear activation function of a 
neuron, there is another nonlinear term F{z} as shown in Figure 11.6. The learning rule for EBP can be 
derived in a similar way as for the delta learning rule: 


Input Hidden Hidden Output 
layer layer #1 layer #2 layer 


X\ 

Us 

SAT 
VAY 


ra 
Pe 
YK 


sid 


FIGURE 11.5 An example of the four layer (4-5-6-3) feed-forward neural network, which is sometimes known 
also as multilayer perceptron (MLP) network. 
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FIGURE 11.6 Error back propagation for neural networks with one output. 


Op = FL f (wixpr + WrXp2 +--+ Waxy) (11.21) 
np 2 
TE= ¥ [d,~0 | (11.22) 
p=1 


ee 3 (C p — Op) )F Teal t (net, )xpi| (11.23) 


The weight update for a single pattern p is 
Aw , = o.(d, — 0))F ene (net, )x, (11.24) 


In the case of batch training (weights are changed once all patterns are applied), 
Aw = Dy Aw,)= «dl )Ftzp}f”(net,)x, | (11.25) 


The main difference is that instead of using just derivative of activation function f’ as in the delta 
learning rule, the product of f’F’ must be used. For multiple outputs as shown in Figure 11.7, the resulted 
weight change would be the sum of all the weight changes from all outputs calculated separately for each 
output using Equation 11.24. In the EBP algorithm, the calculation process is organized in such a 
way that error signals, A,, are being propagated through layers from outputs to inputs as it is shown in 
Figure 11.8. Once the delta values on neurons inputs are found, then weights for this neuron are updated 
using a simple formula: 


Aw, = ax,A, (11.26) 
Figs 
Foz} [| 
+1 E,,{z} FEE 
FIGURE 11.7 Error back propagation for neural networks with multiple outputs. 


© 2011 by Taylor and Francis Group, LLC 


Neural Networks Learning 11-9 


Agr =filAgr: Wi + Apo: Wo + Ags: Ws) 


Api: W11+ Ago: W21 + Ags: W31 


FIGURE 11.8 Calculation errors in neural network using error back-propagation algorithm. The symbols f, and 
g; represent slopes of activation functions. 


where A was calculated using the error back-propagation process for all outputs K: 


K 


Ap=>'[ (de ope) {zp} f’(net pe) | (11.27) 


k=1 


The calculation of the back-propagating error is kind of artificial to the real nervous system. Also, 
the error back-propagation method is not practical from the point of view of hardware realization. 
Instead, it is simpler to find signal gains A,, from the input of the jth neuron to each of the network 
output k (Figure 11.9). For each pattern, the A value for a given neuron, j, can be now obtained for 
each output k: 


Aj.k = Aj, (OK — a ) (11.28) 


_ 905 
ike anet; 
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Inputs 


YA 


+ 
_ 


FIGURE 11.9 Finding gradients using evaluation of signal gains, A;,. 
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and for all outputs for neuron j 
K K 
A; = \ Aj =)'[An(or-4r)] (11.29) 
k=l k=l 


Note that the above formula is general, no matter whether neurons are arranged in layers or not. One 
way to find gains A,, is to introduce an incremental change on the input of the jth neuron and observe 
the change in the kth network output. This procedure requires only forward signal propagation, and it 
is easy to implement in a hardware realization. Another possible way is to calculate gains through each 
layer and then find the total gains as products of layer gains. This procedure is equally or less computa- 
tion intensive than a calculation of cumulative errors in the error back-propagation algorithm. 


11.4.2 Improvements of EBP 


11.4.2.1 Momentum 
The back-propagation algorithm has a tendency for oscillation (Figure 11.10) [PS94]. In order to smooth 


up the process, the weights increment, Aw,, can be modified according to Rumelhart et al. [RHW86]: 


wa +1) = w;(n) + Aw;(n) + NAw;(n—-1) (11.30) 
or according to Sejnowski and Rosenberg [SR87] 


w3( +1) = win) + (1— 0) Aw,(n) + NAw;(n-1) (11.31) 


11.4.2.2 Gradient Direction Search 


The back-propagation algorithm can be significantly accelerated, when after finding components of the 
gradient, weights are modified along the gradient direction until a minimum is reached. This process 
can be carried on without the necessity of computational intensive gradient calculation at each step. The 
new gradient components are calculated once a minimum on the direction of the previous gradient is 
obtained. This process is only possible for cumulative weight adjustment. One method to find a minimum 
along the gradient direction is the three step process of finding error for three points along gradient 
direction and then, using a parabola approximation, jump directly to the minimum (Figure 11.11). 


11.4.2.3 Elimination of Flat Spots 


The back-propagation algorithm has many disadvantages that lead to very slow convergence. One of 
the most painful is that in the back-propagation algorithm, it has difficulty to train neurons with the 
maximally wrong answer. In order to understand this problem, let us analyze a bipolar activation function 


ESS ESS KES 


FIGURE 11.10 Illustration of convergence process for (a) too small learning constant, (b) too large learning 
constant, and (c) large learning constant with momentum. 
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FIGURE 11.11 Search on the gradient direction before a new calculation of gradient components. 


Sie Output 


Actual derivative +1 


FIGURE 11.12 Lack of error back propagation for very large errors. 


shown in Figure 11.12. Maximum error equal 2 exists if the desired output is —1 and actual output is +1. 
At this condition, the derivative of the activation function is close to zero so the neuron is not transpar- 
ent for error propagation. In other words, this neuron with large output error will not be trained or it 
will be trained very slowly. In the mean time, other neurons will be trained and weights of this neuron 
would remain unchanged. 

To overcome this difficulty, a modified method for derivative calculation was introduced by 
Wilamowski and Torvik [WT93]. The derivative is calculated as the slope of a line connecting the point 
of the output value with the point of the desired value as shown in Figure 11.11: 


f, ig = Odesired — Oactual (ul 32) 
m . 
Nt desired — NCb actual 
If the computation of the activation derivative as given by 
f(net)=k[1-0° | (11.33) 
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is replaced by 


f (net) = it ac & ) | (11.34) 


then for small errors 
f (net) =k{1-0° | (11.35) 
and for large errors (err = 2) 


f (net) =k (11.36) 


Note that for small errors, the modified derivative would be equal to the derivative of activation function 
at the point of the output value. Therefore, modified derivative is used only for large errors, which cannot 
be propagated otherwise. 


11.4.3 Quickprop Algorithm 


The fast-learning algorithm using the approach below was proposed by Fahlman [F88], and it is known 
as the quickprop: 


Aw, (t)= —aS,(t) + y,Aw,(t -1) (11.37) 
ee) ee (11.38) 
Ow; 


where 
a is the learning constant 
y is the memory constant (small 0.0001 range) leads to reduction of weights and limits growth of 
weights 
7 is the momentum term selected individually for each weight 


0.01<a<0.6 when Aw, =0or signof Aw; 


(11.39) 

o=0 otherwise 
S(t)Aw;(t) > 0 (11.40) 
Awy(t) = —0Sj(t) + yAwii(t - 1) (11.41) 


The momentum term selected individually for each weight is a very important part of this algorithm. 
Quickprop algorithm sometimes reduces computation time hundreds of times. 
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11.4.4 RPROP-Resilient Error Back Propagation 


Very similar to EBP, but weights are adjusted without using values of the propagated errors but only its 
sign. Learning constants are selected individually to each weight based on the history: 


Be 
Aw; (t) = -O, | (11.42) 
’ " dw; (t) 

Si (f) = FO) a) (11.43) 


y 


min(a- Ot; (£—1),Omax for S,(t)-S,(t-1)>0 
o,(t) = max(b- 04;(t—1), 04min) for S,(t)-S;(t-1) <0 


Qj (t - 1) otherwise 


11.4.5 Back Percolation 


Error is propagated as in EBP and then each neuron is “trained” using the algorithm to train one neuron 
such as pseudo inversion. Unfortunately, pseudo inversion may lead to errors, which are sometimes 
larger than 2 for bipolar or larger than 1 for unipolar. 


11.4.6 Delta-Bar-Delta 


For each weight, the learning coefficient is selected individually. It was developed for quadratic error 
functions 


a for S; (t = 1)D,(t) >0 
Aa, (t)=4-b-a;(t-1) for S;(t-1)D,(t) <0 (11.44) 
0 otherwise 
gn 2 GEO) 
Dj(t)= awy(t) (11.45) 
S(t) = (1-§) D,(t) + 6S, (t -1) (11.46) 


11.5 Advanced Learning Algorithms 


Let us introduce basic definitions for the terms used in advanced second-order algorithms. In the 
first-order methods such as EBP, which is the steepest descent method, the weight changes are propor- 
tional to the gradient of the error function: 


Wri = WE - OS, (11.47) 
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where g is gradient vector. 


gradient g = 


OE 
ow, 
OE 
Ow, 


OE 


ow, 


For the second-order Newton method, the Equation 11.42 is replaced by 


_ -1 
Wri = We - Ag Sk 


where A, is Hessian. 


OE OE 
ow; aw,0W, 
OE OE 
A= 0w,0w, aw; 
OE OE 
Ow,ow,  Ow0W, 
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(11.48) 


(11.49) 


(11.50) 


Unfortunately, it is very difficult to find Hessians so in the Gauss-Newton method the Hessian is 


replaced by product of Jacobians: 


where 


Oey 
ow, 
de>, 
ow, 


dem 
Ow, 


0€;p 
Ow, 
Oe2p 
Ow, 


Oe up 
ow, 
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Knowing Jacobian, J, the gradient can be calculated as 


g=2J'e (11.53) 
Therefore, in Gauss—Newton algorithm, the weight update is calculated as 
Ty \ lyr 
Wi = We - (Judi) Jie (11.54) 


This method is very fast, but it works well only for systems that are almost linear and this is not true for 
neural networks. 


11.5.1 Levenberg—Marquardt Algorithm 


Levenberg and Marquardt, in order to secure convergence, modified Equation (11.46) to the form 
—w. (TT yr (11.55) 
Win =We-ViJe +l] Ine . 


where the l. parameter changed during the training process. If u = 0 algorithm works as Gauss-Newton 
method and for large values of u algorithm works as steepest decent method. 

The Levenberg—Marquardt algorithm was adopted for neural network training by Hagan and Menhaj 
[HM 94], and then Demuth and Beale [DB04] adopted the LM algorithm in MATLAB® Neural Network 
Toolbox. 

The LM algorithm is very fast, but there are several problems: 


1. It was written only for MLP networks, which are not the best architectures for neural networks. 
2. It can handle only problems with relatively small patterns because the size of Jacobian is propor- 
tional to the number of patterns. 


11.5.2 Neuron by Neuron 


‘The recently developed neuron by neuron (NBN) algorithm [WCKD08,CWD08,W09,Y W09] is very fast. 
Figures 11.13 and 11.14 show speed comparison of EBP and NBN algorithms to solve the parity-4 problem. 
The NBN algorithm eliminates most deficiencies of the LM algorithm. It can be used to train neural 
networks with arbitrarily connected neurons (not just MLP architecture). It does not require to com- 
pute and to store large Jacobians, so it can train problems with basically unlimited number of patterns 
[WH10,WHY10]. Error derivatives are computed only in forward pass, so back-propagation process 
is not needed. It is equally fast, but in the case of networks with multiple outputs faster than LM 
1.0E+01 

1.0E-—00 

1.0E-01 

1.0E-—02 

1.0E-03 


1.0E-—04 


Iteration (x 2000) 


FIGURE 11.13. Sum of squared errors as a function of number of iterations for the parity-4 problem using EBP 
algorithm, and 100 runs. 
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Iteration (x 2) 


FIGURE 11.14 Sum of squared errors as a function of number of iterations for the parity-4 problem using NBN 
algorithm and 100 runs. 


algorithm. It can train networks that are impossible to train with other algorithms. A more detailed 
description of the NBN algorithm is given in Chapter 13. 


11.6 Warnings about Neural Network Training 


It is much easier to train neural networks where the number of neurons is larger than required. But, with 
a smaller number of neurons the neural network has much better generalization abilities. It means it will 
respond correctly for patterns not used for training. If too many neurons are used, then the network can be 
overtrained on the training patterns, but it will fail on patterns never used in training. With a smaller number 
of neurons, the network cannot be trained to very small errors, but it may produce much better approximations 
for new patterns. The most common mistake made by many researchers is that in order to speed up the train- 
ing process and to reduce the training errors, they use neural networks with a larger number of neurons than 
required. Such networks would perform very poorly for new patterns not used for training [W09,ISIE,PE10]. 


11.7 Conclusion 


There are several reasons for the frustration of people trying to adapt neural networks for their research: 


1. In most cases, the relatively inefficient MLP architecture is used instead of more powerful topologies 
[WHM03] where connections across layers are allowed. 

2. When a popular learning software is used, such as EBP, the training process is not only very time 
consuming, but frequently the wrong solution is obtained. In other words, EBP is often not able to 
find solutions for neural network with the smallest possible number of neurons. 

3. It is easy to train neural networks with an excessive number of neurons. Such complex archi- 
tectures for a given pattern can be trained to very small errors, but such networks do not have 
generalization abilities. Such networks are not able to deliver a correct response to new patterns, 
which were not used for training [W09,H W09]. In other words, the main purpose of using neural 
networks is missed. In order to properly utilize neural networks, its architecture should be as 
simple as possible to perform the required function. 

4. In order of find solutions for close-to-optimal architectures, second-order algorithms such as 
NBN or LM should be used [WCKD07,WCKD08]. Unfortunately, the LM algorithm adopted in 
the popular MATLAB NN Toolbox can handle only MLP topology without connections across 
layers and these topologies are far from optimal. 


The importance of the proper learning algorithm was emphasized, since with an advanced learning 
algorithm we can train those networks, which cannot be trained with simple algorithms. The software 
used in this work, which implements the NBN algorithm, can be downloaded from [WY09]. 
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12.1 Introduction 


The Levenberg-Marquardt algorithm [L44,M63], which was independently developed by Kenneth 
Levenberg and Donald Marquardt, provides a numerical solution to the problem of minimizing a non- 
linear function. It is fast and has stable convergence. In the artificial neural-networks field, this algo- 
rithm is suitable for training small- and medium-sized problems. 

Manyothermethodshavealreadybeen developed forneural-networks training. Thesteepest descentalgo- 
rithm, also known as the error backpropagation (EBP) algorithm [EHW86,]J88], dispersed the dark clouds 
on the field of artificial neural networks and could be regarded as one of the most significant breakthroughs 
for training neural networks. Many improvements have been made to EBP [WT93,AW95,W96,WCM99], 
but these improvements are relatively minor [W02,WHM03,Y W09,W09,WH10]. Sometimes instead of 
improving learning algorithms special neural network architectures are used, which are easy to train 
[WB99,WB01,WJ96,PE10,BUD90]. The EBP algorithm is still widely used today; however, it is also known 
as an inefficient algorithm because of its slow convergence. There are two main reasons for the slow con- 
vergence [ISIE10,WHY10]: the first reason is that its step sizes should be adequate to the gradients (Figure 
12.1). Logically, small step sizes should be taken where the gradient is steep so as not to rattle out of the 
required minima (because of oscillation). So, if the step size is a constant, it needs to be chosen small. 
Then, in the place where the gradient is gentle, the training process would be very slow. The second reason 
is that the curvature of the error surface may not be the same in all directions, such as the Rosenbrock 
function, so the classic “error valley” problem [092] may exist and may result in the slow convergence. 

The slow convergence of the steepest descent method can be greatly improved by the Gauss-Newton 
algorithm [092]. Using second-order derivatives of error function to “naturally” evaluate the curvature 
of error surface, the Gauss—Newton algorithm can find proper step sizes for each direction and converge 
very fast; especially, if the error function has a quadratic surface, it can converge directly in the first 
iteration. But this improvement only happens when the quadratic approximation of error function is 
reasonable. Otherwise, the Gauss—Newton algorithm would be mostly divergent. 
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EBP algorithm with EBP algorithm with 
small constant step large constant step 
size size 


FIGURE 12.1 Searching process of the steepest descent method with different learning constants: trajectory on 
the left is for small learning constant that leads to slow convergence; trajectory on the right is for large learning 
constant that causes oscillation (divergence). 


The Levenberg-Marquardt algorithm blends the steepest descent method and the Gauss-Newton 
algorithm. Fortunately, it inherits the speed advantage of the Gauss—-Newton algorithm and the stability 
of the steepest descent method. It’s more robust than the Gauss—-Newton algorithm, because in many 
cases it can converge well even if the error surface is much more complex than the quadratic situation. 
Although the Levenberg-Marquardt algorithm tends to be a bit slower than Gauss—Newton algorithm 
(in convergent situation), it converges much faster than the steepest descent method. 

The basic idea of the Levenberg-Marquardt algorithm is that it performs a combined training process: 
around the area with complex curvature, the Levenberg—Marquardt algorithm switches to the steepest 
descent algorithm, until the local curvature is proper to make a quadratic approximation; then it approx- 
imately becomes the Gauss-Newton algorithm, which can speed up the convergence significantly. 


12.2 Algorithm Derivation 


In this part, the derivation of the Levenberg-Marquardt algorithm will be presented in four parts: 
(1) steepest descent algorithm, (2) Newton’s method, (3) Gauss-Newton’s algorithm, and (4) Levenberg- 
Marquardt algorithm. 

Before the derivation, let us introduce some commonly used indices: 


¢ pis the index of patterns, from 1 to P, where P is the number of patterns. 

« mis the index of outputs, from 1 to M, where M is the number of outputs. 

¢ iand jare the indices of weights, from 1 to N, where N is the number of weights. 
« kis the index of iterations. 


Other indices will be explained in related places. 
Sum square error (SSE) is defined to evaluate the training process. For all training patterns and net- 
work outputs, it is calculated by 


Es)? 4a (12.1) 


where 
x is the input vector 
w is the weight vector 
€,.m is the training error at output m when applying pattern p and it is defined as 


Com = dom —Op.m (12.2) 


where 
d is the desired output vector 
o is the actual output vector 
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12.2.1 Steepest Descent Algorithm 


12-3 


The steepest descent algorithm is a first-order algorithm. It uses the first-order derivative of total error 
function to find the minima in error space. Normally, gradient g is defined as the first-order derivative 


of total error function (12.1): 


OE 


ow ow, 


_ 0E(x,w) -| OE 


Ow, 


(12.3) 


aE | 
Own 


With the definition of gradient g in (12.3), the update rule of the steepest descent algorithm could be 


written as 


Wri = We — OS; 


where o is the learning constant (step size). 


(12.4) 


The training process of the steepest descent algorithm is asymptotic convergence. Around the solu- 
tion, all the elements of gradient vector would be very small and there would be a very tiny weight 


change. 


12.2.2 Newton’s Method 


Newton’s method assumes that all the gradient components g,, g, ..., gy are functions of weights. 


£1 =F (W1,W2---wy) 


R2 = F,(w,,w2-+-wy ) 


&N =Fy (W1,W2°--Wy) 


where F,,F,, .. 


(12.5) 


., Fy are nonlinear relationships between weights and related gradient components. 


Unfold each g; (i= 1, 2,..., N) in Equations 12.5 by Taylor series and take the first-order approximation: 


£= Lior ogi Aw, + Ogi Aw, Pao 
ow, Ow, Own 
Og> Og> Og> 

=. a2 Aw, + Aw, +::: A 

I Wy Owy (12.6) 

pasion Aw, +284 Muesiute Og Nn Ribs 
ow, Ow, Owy 

By combining the definition of gradient vector g in (12.3), it could be determined that 

dg: __\owj)_ WE (12.7) 
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By inserting Equation 12.7 to 12.6: 


°E OrE OE 
= 819 +——Aw, + ——— Aw + + Aw 
oe owe WW, owdWwy 
2: 2 2 
OE OE 
= 8) +——_— Aw, + — A) +: + —— — Aw 
Be a ae SOW aaa (12.8) 
OE OE 2 
= + Aw, + Aw, +-+-+=,- Aw 
eu END OwyOW, | OWyOW, ow, 


Comparing with the steepest descent method, the second-order derivatives of the total error function 


need to be calculated for each component of gradient vector. 
In order to get the minima of total error function E, each element of the gradient vector should be 


zero. Therefore, left sides of the Equations 12.8 are all zero, then 


: OE OE 


) 
0= +— Aw, + Aw, +++++ Aw 
au ow; : Ow,ow2 ? Ow,Owy ms 
4 OE OE 
0= +—— Aw, + —— Aw> ++» + ——— Aw 
Be Swen we Ome (12.9) 
OE O’E OE 
0= + Aw, + Aw, +-:-+——Aw 
ane Owydw, | OWyOW ow 
By combining Equation 12.3 with 12.9 
OE : OE OE 
= = Aw, + Aw, +-0°+ Aw 
om, ee ow Ow0w, Owown 
OE OE OE OE 
oe Se a a he 
OwW> pa ow,ow, | owe Ow,dWy (12.10) 
2 2 2 
dE = no =~ d z A 1+ a 7 Aw, +---+—-Awy 
Own Owyow, OWNOW2 OWN 


There are N equations with N unknowns so that all Aw,, can be calculated. With the solutions, the weight 


space can be updated iteratively. 
Equations 12.10 can be also written in matrix form 


OE OE O’E 2 OE 
~g, ow, ow; dw,0w, Ow,0Wn hugs 
: OE O’E O’E O’E 7 
2 eat oe eae Sree el 
=| dw, |=| dw,dwy, ow; Ow,0wy |X : (12.11) 
fae lic) VE VE Sipe | ene 
Own dwydw, OWydw> Own 
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where the square matrix is Hessian matrix: 


OE OE 
ow? ow,ow, 
OE OE 
H=| Ow,0w, ow; 
OE OE 
OwyOW, OWydW, 


Equation 12.11 can be written in matrix form as: 


So 


—g = HAw 


Aw=-H''g 


Therefore, the update rule for Newton’s method is 


-1 
Wr = We — Ay 8x 
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(12.12) 


(12.13) 


(12.14) 


(12.15) 


As the second-order derivatives of total error function, the Hessian matrix H gives the proper evaluation 
on the change of gradient vector. By comparing Equations 12.4 and 12.15, one may notice that well- 


matched step sizes are given by the inverted Hessian matrix. 


12.2.3 Gauss—Newton Algorithm 


If Newton’s method is applied for weight updating, in order to get the Hessian matrix H, the second-order 
derivatives of total error function have to be calculated and it could be very complicated. In order to 
simplify the calculating process, Jacobian matrix J is introduced as 


ow, 
0e12 


ow, 


de, 
Ow, 


dep, 
ow, 
err 
Ow, 


dep. 
Ow, 
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By integrating Equations 12.1 and 12.3, the elements of gradient vector can be calculated as 


He et a(3 Dn dent a) => (Fee “| (12.17) 


Combining Equations 12.16 and 12.17, the relationship between the Jacobian matrix J and the gradient 
vector g would be 


g=Je (12.18) 


where error vector e has the form 


e=| (12.19) 


Cp,M 


Inserting Equation 12.1 into 12.12, the element at ith row and jth column of the Hessian matrix can be 
calculated as 


M 
E de 
h, j= - = pm en + Sj 12.20 
Z ow,dw; ow,0w; 22 Ow; Ow; 4 ( 
where S,, is equal to 
P M 26 
Se Pm en (12.21) 
i 22, dw;0w; Ps 


As the basic assumption of the Newton’s method is that S,; is closed to zero [TM94], the relationship 
between the Hessian matrix H and the Jacobian matrix J en be rewritten as 


H-=J'J (12.22) 


By combining Equations 12.15, 12.18, and 12.22, the update rule of the Gauss-Newton algorithm is 
presented as 


Win =We-(TTe) Snes (12.23) 
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Obviously, the advantage of the Gauss-Newton algorithm over the standard Newton’s method 
(Equation 12.15) is that the former does not require the calculation of second-order derivatives of the 
total error function, by introducing Jacobian matrix J instead. However, the Gauss-Newton algorithm 
still faces the same convergent problem like the Newton algorithm for complex error space optimiza- 
tion. Mathematically, the problem can be interpreted as the matrix JJ may not be invertible. 


12.2.4 Levenberg—Marquardt Algorithm 


In order to make sure that the approximated Hessian matrix J’J is invertible, the Levenberg-Marquardt 
algorithm introduces another approximation to the Hessian matrix: 


H=J'J+ul (12.24) 


where 
Ll is always positive, called combination coefficient 
Tis the identity matrix 


From Equation 12.24, one may notice that the elements on the main diagonal of the approximated 
Hessian matrix will be larger than zero. Therefore, with this approximation (Equation 12.24), it can be 
sure that matrix H is always invertible. 

By combining Equations 12.23 and 12.24, the update rule of the Levenberg—Marquardt algorithm can 
be presented as 


Wer = Me —(IETe +l) Tues (12.25) 


As the combination of the steepest descent algorithm and the Gauss—Newton algorithm, the Levenberg- 
Marquardt algorithm switches between the two algorithms during the training process. When the com- 
bination coefficient [1 is very small (nearly zero), Equation 12.25 is approaching to Equation 12.23 and 
the Gauss—Newton algorithm is used. When combination coefficient [1 is very large, Equation 12.25 
approximates to Equation 12.4 and the steepest descent method is used. 

If the combination coefficient 1. in Equation 12.25 is very big, it can be interpreted as the learning 
coefficient in the steepest descent method (12.4): 


1 
=— (12.26) 
uu 
Table 12.1 summarizes the update rules for various algorithms. 
TABLE 12.1 Specifications of Different Algorithms 
Algorithms Update Rules Convergence Computation Complexity 
EBP algorithm Writ = We - OQ; Stable, slow Gradient 
Newton algorithm Wi = Wi — Ai Unstable, fast Gradient and Hessian 
Gauss-Newton algorithm Wi = We - ( Te; y" Tye Unstable, fast Jacobian 
1 

Levenberg-Marquardt algorithm w,,,=W,— ( UJ ,+ul ) J,€,. Stable, fast Jacobian 
NBN algorithm [08WC]* Wit =Wr-Qe 2 Stable, fast Quasi Hessian* 


® Reference Chapter 12. 
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12.3 Algorithm Implementation 


In order to implement the Levenberg-Marquardt algorithm for neural network training, two problems 
have to be solved: how does one calculate the Jacobian matrix, and how does one organize the training 
process iteratively for weight updating. 

In this section, the implementation of training with the Levenberg—Marquardt algorithm will be 
introduced in two parts: (1) calculation of the Jacobian matrix; (2) training process design. 


12.3.1 Calculation of the Jacobian Matrix 


In this section j and k are used as the indices of neurons ranging from 1 to nn, where nn is the number 
of neurons contained in a topology; i is the index of neuron inputs ranging from 1 to ni, where ni is the 
number of inputs and it may vary for different neurons. 

As an introduction of basic concepts of neural network training, let us consider a neuron j with ni 
inputs, as shown in Figure 12.2. If neuron j is in the first layer, all its inputs would be connected to 
the inputs of the network, otherwise, its inputs can be connected to outputs of other neurons or to net- 
works inputs if connections across layers are allowed. 

Node y is an important and flexible concept. It can be y,;, meaning the ith input of neuron j. It also 
can be used as y, to define the output of neuron j. In the following derivation, if node y has one index 
then it is used as a neuron output node, but if it has two indices (neuron and input), it is a neuron 
input node. 


The output node of neuron j is calculated using 
y; = f,(net;) (12.27) 


where f; is the activation function of neuron j and net value net, is the sum of weighted input nodes of 
neuron j: 


net; = SY mais +Wio (12.28) 
i=l 


where 
y,iis the ith input node of neuron j, weighted by w,, 
W;o is the bias weight of neuron j 


FIGURE 12.2 Connection ofa neuron j with the rest of the network. Nodes y,; could represent network inputs or 
outputs of other neurons. F,, (y;) is the nonlinear relationship between the neuron output node y, and the network 
output o,,,. 
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Using Equation 12.28, one may notice that derivative of net; is 


Onet; 
OW j,i 


= Vivi (12.29) 


and slope S of activation function f is 


_ dy; _ Ofj(net;) 
Onet,; Onet; 


sj (12.30) 


Between the output node y, of a hidden neuron j and network output o,,, there is a complex nonlinear 
relationship (Figure 12.2): 


Om = Fn, (Vj) (12.31) 


where o,, is the mth output of the network. 

The complexity of this nonlinear function F,,, (y;) depends on how many other neurons are between 
neuron j and network output m. If neuron j is at network output m, then o,, = dj and EF, yj)=1, where 
F,, is the derivative of nonlinear relationship between neuron j and output m. 

The elements of the Jacobian matrix in Equation 12.16 can be calculated as 


D€pm — 9(4p.m—Opm) 0pm VOpm Ay; net; 


(12.32) 
OW j,i OW j,i OW j,i oy; Onet; OW j,i 
Combining with Equations 12.28 through 12.30, 12.31 can be rewritten as 
de m - 
pi 


where F,, is the derivative of nonlinear function between neuron j and output m. 

The computation process for the Jacobian matrix can be organized according to the traditional back- 
propagation computation in first-order algorithms (like the EBP algorithm). But there are also differ- 
ences between them. First of all, for every pattern, in the EBP algorithm, only one backpropagation 
process is needed, while in the Levenberg-Marquardt algorithm the backpropagation process has to 
be repeated for every output separately in order to obtain consecutive rows of the Jacobian matrix 
(Equation 12.16). Another difference is that the concept of backpropagation of 6 parameter [N89] has to 
be modified. In the EBP algorithm, output errors are parts of the 6 parameter: 


M 
5)= 5; ¥ Buen (12.34) 
m=1 


In the Levenberg-Marquardt algorithm, the 6 parameters are calculated for each neuron j and each 
output m, separately. Also, in the backpropagation process, the error is replaced by a unit value [TM94]: 


Sn, = 5)Ei (12.35) 


By combining Equations 12.33 and 12.35, elements of the Jacobian matrix can be calculated by 


de, 
nr 12.36 
OW j,i ee 
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FIGURE 12.3 Three-layer multilayer perceptron network: the number of inputs is n;, the number of outputs is 1), 
and n, and n, are the numbers of neurons in the first and second layers separately. 


There are two unknowns in Equation 12.36 for the Jacobian matrix computation. The input node, y;;, 
can be calculated in the forward computation (signal propagating from inputs to outputs); while 6,,,; is 
obtained in the backward computation, which is organized as errors backpropagating from output neu- 
rons (output layer) to network inputs (input layer). At output neuron m (j= m), 8, = Sin 

For better interpretation of forward computation and backward computation, let us consider the 
three-layer multilayer perceptron network (Figure 12.3) as an example. 

For a given pattern, the forward computation can be organized in the following steps: 


a. Calculate net values, slopes, and outputs for all neurons in the first layer: 


net; = Siw, + Wio (12.37) 
i=l 
1_ ¢l 1 
y= fj (net;) (12.38) 
of! 
sj= Mi ‘ (12.39) 
net; 


where 
I, are the network inputs 
the superscript “1” means the first layer 
j is the index of neurons in the first layer 


b. Use the outputs of the first layer neurons as the inputs of all neurons in the second layer, do a 
similar calculation for net values, slopes, and outputs: 


ny 
net; = Sy yiwh, +W7.o (12.40) 
i=l 


yj = fi (net;) (12.41) 
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sj = of 


a dnet; 


(12.42) 


c. Use the outputs of the second layer neurons as the inputs of all neurons in the output layer (third 
layer), do a similar calculation for net values, slopes, and outputs: 


n 
net; a Sy yw}, +Wio (12.43) 
i=] 
0; = f; (net}) (12.44) 
of; 
3 i 
5=s 12.45 
: onet; ( ) 


After the forward calculation, node array y and slope array s can be obtained for all neurons with 
the given pattern. 
With the results from the forward computation, for a given output j, the backward computation 
can be organized as 
d. Calculate error at the output j and initial as the slope of output j: 


,=5 (12.47) 
ik = (12.48) 


where 
d; is the desired output at output j 
o, is the actual output at output j obtained in the forward computation 
5;,, is the self-backpropagation 
5; is the backpropagation from other neurons in the same layer (output layer) 


e. Backpropagate 5 from the inputs of the third layer to the outputs of the second layer 
Si = WD}, (12.49) 


where k is the index of neurons in the second layer, from 1 to n,. 
f. Backpropagate 6 from the outputs of the second layer to the inputs of the second layer 


82, = 8482 (12.50) 


where k is the index of neurons in the second layer, from 1 to n). 
g. Backpropagate 6 from the inputs of the second layer to the outputs of the first layer 


7) 
ik = SY wii, (12.51) 
i=1 


where k is the index of neurons in the first layer, from 1 to 1). 
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h. Backpropagate 6 from the outputs of the first layer to the inputs of the first layer 
jk = OF KS (12.52) 
where k is the index of neurons in the second layer, from 1 to n,. 


For the backpropagation process of other outputs, the steps (d)-(h) are repeated. 

By performing the forward computation and backward computation, the whole 8 array and y array 
can be obtained for the given pattern. Then related row elements (no rows) of the Jacobian matrix can be 
calculated by using Equation 12.36. 

For other patterns, by repeating the forward and backward computation, the whole Jacobian matrix 
can be calculated. 

The pseudo code of the forward computation and backward computation for the Jacobian matrix in 
the Levenberg-Marquardt algorithm is shown in Figure 12.4. 


12.3.2 Training Process Design 


With the update rule of the Levenberg—Marquardt algorithm (Equation 12.25) and the computation of 
the Jacobian matrix, the next step is to organize the training process. 

According to the update rule, if the error goes down, which means it is smaller than the last error, 
it implies that the quadratic approximation on total error function is working and the combination 
coefficient [t could be changed smaller to reduce the influence of gradient descent part (ready to speed 
up). On the other hand, if the error goes up, which means it’s larger than the last error, it shows that it’s 
necessary to follow the gradient more to look for a proper curvature for quadratic approximation and 
the combination coefficient Ul is increased. 


for all patterns 
%Forward computation 
for all layers 
for all neurons in the layer 
calculate net; % Equation 
calculate output; % Equation 
calculate slope; % Equation 
end; 
end; 
SBackward computation 
initial delta as slope; 
for all outputs 
calculate error; 
for all layers 
for all neurons in the previous layer 
for all neurons in the current layer 
multiply delta through weights 
sum the backpropagated delta at proper nodes 
end; 
multiply delta by slope; 
end; 
end; 
end; 
end; 


FIGURE 12.4 Pseudo code of forward computation and backward computation implementing the Levenberg- 
Marquardt algorithm. 


© 2011 by Taylor and Francis Group, LLC 


Levenberg—Marquardt Training 


12-13 


| Jacobian matrix computation | 


We= Wet m=m+1 


Wry = Wr (y+ ul)" Jey | 


w=ux 10 
restore W; 


Error evaluation 


FIGURE 12.5 Block diagram for training using the Levenberg-Marquardt algorithm: w, is the current weight, 


W,,,; is the next weight, E,,, is the current total error, and E, is the last total error. 


Therefore, the training process using the Levenberg-Marquardt algorithm could be designed as follows: 


i. With the initial weights (randomly generated), evaluate the total error (SSE). 


ii. Do an update as directed by Equation 12.25 to adjust weights. 
iii. With the new weights, evaluate the total error. 


iv. If the current total error is increased as a result of the update, then retract the step (such as reset 
the weight vector to the previous value) and increase combination coefficient U1 by a factor of 10 or 


by some other factors. Then go to step ii and try an update again. 


v. If the current total error is decreased as a result of the update, then accept the step (such as keep 
the new weight vector as the current one) and decrease the combination coefficient ut by a factor 


of 10 or by the same factor as step iv. 


vi. Go to step ii with the new weights until the current total error is smaller than the required value. 


The flowchart of the above procedure is shown in Figure 12.5. 


12.4 Comparison of Algorithms 


In order to illustrate the advantage of the Levenberg—Marquardt algorithm, 
let us use the parity-3 problem (see Figure 12.6) as an example and make a 
comparison among the EBP algorithm, the Gauss-Newton algorithm, and 
the Levenberg algorithm [WCKD07]. 

Three neurons in multilayer perceptron network (Figure 12.7) are used 
for training, and the required training error is 0.01. In order to compare 
the convergent rate, for each algorithm, 100 trials are tested with randomly 
generated weights (between —1 and 1). 
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Inputs Outputs 
-1 |-1 |-1 -1 
-1 |-1 1 1 
-1 1 |-1 1 
-1 1 1 -1 
1 |-1 [-1 1 
1 |-1 1 -1 
1 1 |-1 -1 
1 1 1 1 


FIGURE 12.6 Training 
patterns of the parity-3 
problem. 
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FIGURE 12.7 Three neurons in multilayer perceptron network. 
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(c) Iteration (d) Iteration 


FIGURE 12.8 Training results of the parity-3 problem: (a) the EBP algorithm (a = 1), (b) the EBP algorithm ( = 100), 
(c) the Gauss—Newton algorithm, and (d) the Levenberg—Marquardt algorithm. 
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TABLE 12.2 Comparison among Different Algorithms for Parity-3 Problem 


Algorithms Convergence Rate (%) Average Iteration | Average Time (ms) 
EBP algorithm (a= 1) 100 1646.52 320.6 
EBP algorithm (a = 100) 79 171.48 36.5 
Gauss—Newton algorithm 3 4.33 1.2 
Levenberg- Marquardt algorithm 100 6.18 1.6 


The training results are shown in Figure 12.8 and the comparison is presented in Table 12.2. One 
may notice that: (1) for the EBP algorithm, the larger the training constant « is, the faster and less 
stable the training process will be; (2) the Levenberg—Marquardt is much faster than the EBP algo- 
rithm and more stable than the Gauss—-Newton algorithm. 

For more complex parity-N problems, the Gauss-Newton method cannot converge at all, and the 
EBP algorithm also becomes more inefficient to find the solution, while the Levenberg—Marquardt algo- 
rithm may lead to successful solutions. 


12.5 Summary 


The Levenberg-Marquardt algorithm solves the problems existing in both the gradient descent method 
and the Gauss-Newton method for neural-networks training, by the combination of those two algo- 
rithms. It is regarded as one of the most efficient training algorithms [TM 94]. 

However, the Levenberg—Marquardt algorithm has its flaws. One problem is that the Hessian matrix 
inversion needs to be calculated each time for weight updating and there may be several updates in 
each iteration. For small size networks training, the computation is efficient, but for large networks, 
such as image recognition problems, this inversion calculation is going to be a disaster and the speed 
gained by second-order approximation may be totally lost. In that case, the Levenberg-Marquardt algo- 
rithm may be even slower than the steepest descent algorithm. Another problem is that the Jacobian 
matrix has to be stored for computation, and its size is P x M x N, where P is the number of pat- 
terns, M is the number of outputs, and N is the number of weights. For large-sized training patterns, 
the memory cost for the Jacobian matrix storage may be too huge to be practical. Only very recently 
Levenberg-Marquardt algorithm was implemented on other than MLP (multilayer preceptron) net- 
works [WCKD08,WH10,WHY10,ISIE10]. 

Even though there are still some problems not solved for the Levenberg-Marquardt training, for 
small- and medium-sized networks and patterns, the Levenberg-Marquardt algorithm is remarkably 
efficient and strongly recommended for neural network training. 
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Since the development of EBP—error backpropagation—algorithm for training neural networks, many 
attempts have been made to improve the learning process. There are some well-known methods like 
momentum or variable learning rate and there are less known methods which significantly accelerate 
learning rate [WT93,AW95,WCM99,W09,W H10,PE10,BUD09,W B01]. The recently developed NBN (neu- 
ron-by-neuron) algorithm [WCHK07,;WCKD08,Y W09,WHY 10] is very efficient for neural network train- 
ing. Compared to with the well-known Levenberg—Marquardt (LM) algorithm (introduced in Chapter 12) 
[L44,M63], the NBN algorithm has several advantages: (1) the ability to handle arbitrarily connected 
neural networks; (2) forward-only computation (without backpropagation process); and (3) the direct 
computation of quasi-Hessian matrix (no need to compute and store Jacobian matrix). This chapter is 
organized around the three advantages of the NBN algorithm. 


13.2 Computational Fundamentals 


Before the derivation, let us introduce some commonly used indices in this chapter: 


¢ pis the index of patterns, from 1 to np, where np is the number of patterns. 
¢ mis the index of outputs, from 1 to no, where no is the number of outputs. 
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e jand kare the indices of neurons, from 1 to nn, where mn is the number of neurons. 
e iis the index of neuron inputs, from 1 to ni, where ni is the number of inputs and it may vary 
for different neurons. 


Other indices will be explained in related places. 
Sum square error (SSE) E is defined to evaluate the training process. For all patterns and outputs, it 
is calculated by 


E= Ly Sd (13.1) 


p=l m=1 


where ¢,,,, is the error at output m defined as 
€p.m = Op,m —Ap.m (13.2) 


where d,,, and 0, ,, are desired output and actual output, respectively, at network output m for training 
pattern p. 

In all algorithms, besides the NBN algorithm, the same computations are being repeated for one 
pattern at a time. Therefore, in order to simplify notations, the index p for patterns will be skipped in 
following derivations, unless it is essential. 


13.2.1 Definition of Basic Concepts in Neural Network Training 


Let us consider neuron j with ni inputs, as shown in Figure 13.1. If neuron j is in the first layer, all its 
inputs would be connected to the inputs of the network; otherwise, its inputs can be connected to out- 
puts of other neurons or to network inputs if connections across layers are allowed. 

Node y is an important and flexible concept. It can be y,;, meaning the ith input of neuron j. It also 
can be used as y, to define the output of neuron j. In this chapter, if node y has one index (neuron), 
then it is used as a neuron output node; while if it has two indices (neuron and input), it is a neuron 
input node. 


FIGURE 13.1 Connection ofa neuron j with the rest of the network. Nodes y,; could represent network inputs or 
outputs of other neurons. F,,(y,) is the nonlinear relationship between the neuron output node y, and the network 
output o,,,. 
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Output node of neuron j is calculated using 
yj = fj(net;) (13.3) 


where 
f, is the activation function of neuron j 
net value net; is the sum of weighted input nodes of neuron j: 


net; = Swi + Wj (13.4) 


i=l 


where 
y;; is the ith input node of neuron j 
weighted by w,,, and w,, is the bias weight of neuron j 


Using (13.4) one may notice that derivative of net, is 


Onet; _ : (13.5) 
OW j,i a 
and slope sj of activation function ff is 
oy;  Of,(net; 
s,s filet) (13.6) 


Onet,; Onet,; 


Between the output node y; of a hidden neuron j and network output o 
relationship (Figure 13.1): 


there is a complex nonlinear 


n? 


Om = Fn,j(Vj) (13.7) 


where o,, is the mth output of the network. 

The complexity of this nonlinear function F,,, (y;) depends on how many other neurons are between 
neuron j and network output m. If neuron j is at network output m, then o,, = y,and F,, ;(y;) = 1, where 
E,,; is the derivative of nonlinear relationship between neuron j and output m. 


13.2.2 Jacobian Matrix Computation 
The update rule of the LM algorithm is [TM94] 


Wrst = Wr (1, + yl)” Tien 
(13.8) 
where 
n is the index of iterations 
is the combination coefficient 
Lis the identity matrix 
J is the Jacobian matrix (Figure 13.2) 


From Figure 13.2, one may notice that, for every pattern p, there are no rows of the Jacobian matrix 


where no is the number of network outputs. The number of columns is equal to number of weights in 
the networks and the number of rows is equal to np x no. 
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FIGURE 13.2 Structure of the Jacobian matrix: (1) the number of columns is equal to the number of weights and 


(2) each row corresponds to a specified training pattern p and output m. 


The elements of the Jacobian matrix can be calculated by 


Om _ lm OY; Onet; 


(13.9) 
OW j,i oy; Onet OW j,i 
By combining with (13.2), (13.5), (13.6), and (13.7), (13.9) can be written as 
a = Yji5jFn,j (13.10) 
OW ji 


In second-order algorithms, the parameter 5 [N89,TM94] is defined to measure the EBP process, as 


8,7 = $jE nj (13.11) 
By combining (13.10) and (13.11), elements of the Jacobian matrix can be calculated by 


OC in 
OW j,i 


= ViiOm,j (13.12) 


Using (13.12), in backpropagation process, the error can be replaced by a unit value “1.” 
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13.3 Training Arbitrarily Connected Neural Networks 


The NBN algorithm introduced in this chapter is developed for training arbitrarily connected neural 
networks using the LM update rule. Instead of layer-by-layer computation (introduced in Chapter 12), 
the NBN algorithm does the forward and backward computation based on NBN routings [WCHK07], 
which makes it suitable for arbitrarily connected neural networks. 


13.3.1 Importance of Training Arbitrarily Connected Neural Networks 


The traditional implementation of the LM algorithm [TM94], like being adopted in MATLAB® neural 
network toolbox (MNNT), was developed only for standard multilayer perceptron (MLP) networks, it 
turns out that the MLP networks are not efficient. 

Figure 13.3 shows the smallest structures to solve parity-7 problem. The standard MLP network with 
one hidden layer (Figure 13.3a) needs at least eight neurons to find the solution. The BMLP (bridged 
multiplayer perceptron) network (Figure 13.3b) can solve the problem with four neurons. The FCC (fully 
connected cascade) network (Figure 13.3c) is the most powerful one, and it only requires three neurons 
to get the solutions. One may notice that the last two types of networks are better choices for efficient 
training, but they also require more challenging computation. 


Inputl1 Gz 
Input 2 € 

Input 3 
Input 4 
Input 5 
Input6 4 
Input 7 


Nyy y 


+ 


wer : 


FIGURE 13.3 Smallest structures for solving parity-7 problem: (a) standard MLP network (64 weights), (b) BMLP 
network (35 weights), and (c) FCC network (27 weights). 
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FIGURE 13.4 Five neurons in arbitrarily connected network. 


13.3.2 Creation of Jacobian Matrix for Arbitrarily Connected 
Neural Networks 


In this section, the NBN algorithm for calculating the Jacobian matrix for arbitrarily connected feed- 
forward neural networks is presented. The rest of the computations for weight updating follow the LM 
algorithm, as shown in Equation 13.9. 

In the forward computation, neurons are organized according to the direction of signal propagation, 
while in the backward computation, the analysis will follow the backpropagation procedures. 

Let us consider the arbitrarily connected network with one output, as shown in Figure 13.4. 

For the network in Figure 13.4, using the NBN algorithm, the network topology can be described 
similarly in the SPICE program: 


n, [model] 3 1 2 

n, [model] 4 1 2 

n,; [model] 5 3 4 

n, [model] 61245 
n, [model] 7 3 5 6 


Notice that each line corresponds to one neuron. The first part (n,-n;) is the neuron name (Figure 
13.4). The second part “[mode1]” is the neuron models, such as bipolar, unipolar, and linear. Models 
are declared in separate lines where the types of activation functions and the neuron gains are specified. 
The first digit in each line after the neuron model indicates the network nodes starting with the output 
node of the neuron, followed with its input nodes. 

Please notice that neurons must be ordered from inputs to neuron outputs. It is important that, for 
each given neuron, the neuron inputs must have smaller indices than its output. 

The row elements of the Jacobian matrix for a given pattern are being computed in the following three 
steps [WCKD08]: 


1. Forward computation 
2. Backward computation 
3. Jacobian element computation 


13.3.2.1 Forward Computation 


In the forward computation, the neurons connected to the network inputs are first processed so that 
their outputs can be used as inputs to the subsequent neurons. The following neurons are then processed 
as their input values become available. In other words, the selected computing sequence has to follow 
the concept of feedforward signal propagation. If a signal reaches the inputs of several neurons at the 
same time, then these neurons can be processed in any sequence. In the example in Figure 13.4, there are 
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two possible ways in which neurons can be processed in the forward direction: n,n,n,n,n, or n,n\N,N4Ns. 
The two procedures will lead to different computing processes but with exactly the same results. When 
the forward pass is concluded, the following two temporary vectors are stored: the first vector y with the 
values of the signals on the neuron output nodes and the second vector s with the values of the slopes of 
the neuron activation functions, which are signal dependent. 


13.3.2.2 Backward Computation 


The sequence of the backward computation is opposite to the forward computation sequence. The pro- 
cess starts with the last neuron and continues toward the input. In the case of the network in Figure 13.4, 
the following are two possible sequences (backpropagation paths): n,n4n,n.n, or n;4n3N\n,, and also 
they will have the same results. To demonstrate the case, let us use the n;n4n3/,n, sequence. The vector 
6 represents signal propagation from a network output to the inputs of all other neurons. The size of this 
vector is equal to the number of neurons. 

For the output neuron n,, its sensitivity is initialed using its slope 6, ; = s;. For neuron n,, the delta at n, will 
be propagated by w,;—the weight between n, and n,, then by the slope of neuron n,. So the delta parameter 
of n, is presented as 6, , = 5, 5W45S4. For neuron n,, the delta parameters of n, and n, will be propagated to the 
output of neuron n, and summed, then multiplied by the slope of neuron n;, as 6,3 = (8,,5W35 + 8),4W34)53 
For the same procedure, it could be obtained that 6, , = (8,,3W3 + 6;,4W24)5 and 6,, = (8,3”,3 + 8,5W15)5). 
After the backpropagation process is done at neuron N1, all the elements of array 6 are obtained. 


13.3.2.3 Jacobian Element Computation 


After the forward and backward computation, all the neuron outputs y and vector 6 are calculated. Then 
using Equation 13.12, the Jacobian row for a given pattern can be obtained. 

By applying all training patterns, the whole Jacobian matrix can be calculated and stored. 

For arbitrarily connected neural networks, the NBN algorithm for the Jacobian matrix computation 
can be organized as shown in Figure 13.5. 


for all patterns (np) 

% Forward conput ati on 

for all neurons (nn) 
for all weights of the neuron (nx) 
cal cul ate net; %Eq. (4) 
end; 
calcul ate neuron output; %Eq (3) 
calcul ate neuron sl ope; %Eq. (6) 


end; 
for all outputs (no) 
calcul ate error; %Eq. (2) 


YBackward conput ati on 
initial delta as slope 
for all neurons starting fromout put neurons (nn) 
for the weights connected to other neurons (ny) 
multiply delta through wei ghts 
sumthe backpropagated delta at proper nodes 


end; 
multiply delta by slope (for hidden neurons); 
end; 
rel ated Jacobian row conputation; Yq. (12) 
end; 


end; 


FIGURE 13.5 Pseudo code using NBN algorithm for Jacobian matrix computation. 
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13.3.3 Solve Problems with Arbitrarily Connected Neural Networks 


13.3.3.1 Function Approximation Problem 


Function approximation is usually used in nonlinear control realm of neural networks, for control sur- 
face prediction. In order to approximate the function shown below, 25 points are selected from 0 to 4 as 
the training patterns. With only four neurons in FCC networks (as shown in Figure 13.6), the training 
result is presented in Figure 13.7. 


z= 4exp( 0.15(x — 4)? -0.5(y 3)’)+10° (13.13) 


13.3.3.2 Two-Spiral Problem 


Two-spiral problem is considered a good evaluation of both training algorithms and training architec- 
tures [AS99]. Depending on the neural network architecture, different numbers of neurons are required 
for successful training. For example, using standard MLP networks with one hidden layer, 34 neurons 
are required for two-spiral problem [PLI08]; while with the FCC architecture, it can be solved with only 
eight neurons using the NBN algorithm. NBN algorithms are not only much faster but also can train 
reduced size networks which cannot be handled by the traditional EBP algorithm (see Table 13.1). 

For the EBP algorithm, learning constant is 0.005 (largest possible to avoid oscillation) and momentum 
is 0.5; maximum iteration is 1,000,000 for EBP algorithm and 1000 for the LM algorithm; desired error = 
0.01; all neurons are in FCC networks; there are 100 trials for each case. 


FIGURE 13.6 Network used for training the function approximation problem; notice the output neuron is a 
linear neuron with gain = 1. 


Bo rF NM WwW 


(a) 0 0 (b) 0 0 


FIGURE 13.7 Averaged SSE between desired surface (a) and neural prediction (b) is 0.0025. 
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TABLE 13.1 Training Results of Two-Spiral Problem 


Success Average Number Average 
Rate (%) of Iterations Time (s) 


Neurons EBP NBN EBP NBN EBP NBN 


8 0 13 ‘Failing = =287.7_— Failing = 0.88 
9 0 24 Falling 261.4 Failing 0.98 
10 o 40 Falling 243.9 Failing 1.57 
re 0 69 Falling 231.8 Failing 1.62 
12 63 80 410,254 175.1 633.91 1.70 
13 85 89 335,531 159.7 620.30 2.09 


14 92 92 266,237 137.3. 605.32 2.40 


13.4 Forward-Only Computation 


The NBN procedure introduced in Section 3 requires both forward and backward computation. 
Especially, as shown in Figure 13.5, one may notice that for networks with multiple outputs, the back- 
propagation process has to be repeated for each output. 

In this section, an improved NBN computation is introduced to overcome the problem, by removing 
the backpropagation process in the computation of the Jacobian matrix. 


13.4.1 Derivation 


The concept of 6,,, was described in Section 13.2. One may notice that 6,, ; can be interpreted also as a 
signal gain between net input of neuron j and the network output m. Let us extend this concept to gain 
coefficients between all neurons in the network (Figures 13.8 and 13.10). The notation of &j is an exten- 
sion of Equation 13.11 and can be interpreted as signal gain between neurons j and k, and it is given by 


8, = ei _ HO) we, (13.14) 
: Onet; oy;  Onet; ae 


where 
k and j are indices of neurons 
F,,(y) is the nonlinear relationship between the output node of neuron k and the output node of 
neuron j 


Network inputs 
Network outputs 


FIGURE 13.8 Interpretation of 6,,, as a signal gain, where in feedforward network neuron j must be located before 
neuron k, 
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Naturally in feedforward networks, k 2 j. If k =j, then 8,, = s,, where s, is the slope of activation func- 
tion calculated by Equation 13.6. Figure 13.8 illustrates this extended concept of 5,; parameter as a 
signal gain. 

The matrix 8 has a triangular shape and its elements can be calculated in the forward-only process. 
Later, elements of the Jacobian can be obtained using Equation 13.12, where only last rows of matrix 
5 associated with network outputs are used. The key issue of the proposed algorithm is the method of 
calculating of 6, , parameters in the forward calculation process, and it will be described in the next part 
of this section. 


13.4.2 Calculation of 6 Matrix for FCC Architectures 


Let us start our analysis with fully connected neural networks (Figure 13.9). Any other architecture 
could be considered as a simplification of fully connected neural networks by eliminating connections 
(setting weights to zero). If the feedforward principle is enforced (no feedback), fully connected neural 
networks must have cascade architectures. 

Slopes of neuron activation functions s, can be also written in the form of 5 parameter as 6,; = s;. By 
inspecting Figure 13.10, 5 parameters can be written as follows: 

For the first neuron, there is only one 5 parameter 


8 = 51 (13.15) 


Inputs 


FIGURE 13.10 The 6,,, parameters for the neural network of Figure 13.9. Input and bias weights are not used in 
the calculation of gain parameters. 
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For the second neuron, there are two 6 parameters 


d2 = $2 
(13.16) 
82,1 = $2W1,251 
For the third neuron, there are three 5 parameters 
3,3 = $3 
53,2 = $3W2,382 (13.17) 


53,1 = $3W1,351 + S3W2,352W1,251 


One may notice that all 6 parameters for the third neuron can be also expressed as a function of 
6 parameters calculated for previous neurons. Equations 13.17 can be rewritten as 


53,3 = 53 
83,2 = 03,3W2,382,2 (13.18) 


83,1 = 93,3W1,301,1 + 63,3W2,302,1 
For the fourth neuron, there are four 6 parameters 


Ou = $4 


843 a 34,4W3,495,3 
(13.19) 
S42 = 84,4W2,459,2 + 84,4W3,453,2 


S41 = 84,4W1,491,1 + 84,4W2,482,1 + 4,403,453, 


The last parameter 6,, can be also expressed in a compacted form by summing all terms connected to 
other neurons (from 1 to 3) 


3 
oui = 8.4 Wiad (13.20) 
i=l 
The universal formula to calculate 6, ; parameters using already calculated data for previous neurons is 
k- 
bij = Bue Didi j (13.21) 
i=j 


where in feedforward network, neuron j must be located before neuron k, so k 2 j; 8, = 5, is the slope of 
activation function of neuron k; w,,, is the weight between neuron j and neuron k; and 6, ; is a signal gain 
through weight w,, and through other part of network connected to w,,. 

In order to organize the process, the nn x nn computation table is used for calculating signal gains 
between neurons, where mn is the number of neurons (Figure 13.11). Natural indices (from 1 to nn) 
are given for each neuron according to the direction of signal propagation. For signal gain computa- 
tion, only connections between neurons need to be concerned, while the weights connected to network 
inputs and biasing weights of all neurons will be used only at the end of the process. For a given pattern, 
a sample of the nn x mn computation table is shown in Figure 13.11. One may notice that the indices of 
rows and columns are the same as the indices of neurons. In the following derivation, let us use k and j 
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Neuron 
Index 1 2 J k mn 
1 1841 <Wy2 = Wy oe Wik oo Winn 
2 [P8352] S22 te * Woj | + Wo,k Wo nn; 
j ; 5a b2 | Pet : by She Wik oe Winn’ 
die ‘ Sn =e — i 
k i 8e1 8x2 eee 8x as SKK eS ine 2 Wknul 
nn Sym - Sy02- ei2eed Syunj oa ee Snnk- rs eee Singin 


FIGURE 13.11 
weight array w presents only the connections between neurons, while network input weights and biasing weights 


The nn x nn computation table; gain matrix 8 contains all the signal gains between neurons; 
are not included. 


used as neuron indices to specify the rows and columns in the computation table. In feedforward net- 
work, k > j and matrix 6 has a triangular shape. 

The computation table consists of three parts: weights between neurons in the upper triangle, vector 
of slopes of activation functions in main diagonal, and signal gain matrix 6 in lower triangle. Only the 
main diagonal and lower triangular elements are computed for each pattern. Initially, elements on main 
diagonal 6, ,, = s, are known as slopes of activation functions and values of signal gains 6, are being 
computed subsequently using Equation 13.21. 

The computation is being processed NBN starting with the neuron closest to network inputs. At first, 
the row number one is calculated and then elements of subsequent rows. Calculation on row below is 
done using elements from above rows using Equation 13.21. After completion of forward computation 
process, all elements of 6 matrix in the form of the lower triangle are obtained. 

In the next step, elements of the Jacobian matrix are calculated using Equation 13.12. In the case of neu- 
ral networks with one output, only the last row of 6 matrix is needed for the gradient vector and Jacobian 
matrix computation. If networks have more outputs no, then last no rows of 6 matrix are used. For exam- 
ple, if the network shown in Figure 13.9 has three outputs, the following elements of 6 matrix are used 


851 822 = $2 52,3 =0 804 =0 
83,1 83,2 633=53  634=0 (13.22) 
oui 842 843 Sua =S4 


and then for each pattern, the three rows of Jacobian matrix, corresponding to three outputs, are calcu- 
lated in one step using Equation 13.12 without additional propagation of 5 


5.x {ni} s, x {yo} ox{ys} ox{ys} 
83:X{ni}  832x{y} ssx{ysf Ox {ya} (13.23) 
Saux{nf  S42x{yef Saax {yap sax {af 
neuron 1 neuron 2 neuron 3 neuron 4 


where neurons’ input vectors y, through y, have 6, 7, 8, and 9 elements respectively (Figure 13.9), cor- 
responding to number of weights connected. Therefore, each row of the Jacobian matrix has 6 + 7 + 
8 + 9 = 30 elements. If the network has three outputs, then from six elements of 6 matrix and 3 slopes, 
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90 elements of the Jacobian matrix are calculated. One may notice that the size of newly introduced 8 
matrix is relatively small, and it is negligible in comparison with other matrices used in calculation. 

The improved NBN procedure gives all the information needed to calculate the Jacobian matrix 
(13.12), without backpropagation process; instead, 6 parameters are obtained in relatively simple for- 
ward computation (see Equation 13.21). 


13.4.3 Training Arbitrarily Connected Neural Networks 


The proposed computation above was derived for fully connected neural networks. If the network is not 
fully connected, then some elements of the computation table are zero. Figure 13.12 shows computation 


() 


FIGURE 13.12 Three different architectures with six neurons: (a) FCC network, (b) MLP network, and (¢) arbi- 
trarily connected neural network. 
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for all patterns (np) 
% For ward conput ati on 
for all neurons (nn) 
for all weights of the neuron (nx) 


cal cul ate net; % Eq. (4) 
end; 

calcul ate neuron out put; %Eq. (3) 
calcul ate neuron sl ope; % Eq. (6) 


set current slope as delta; 
for weights connected to previous neurons (ny) 
for previous neurons (nz) 
multiply delta through weights then sum; % Eq. (24) 
end; 
multiply the sum by the slope; % Eq. (25) 
end; 
rel ated J acobi an el enents comput ati on; %Eq. (12) 
end; 
for all outputs (no) 
calculate error; %Eq. (2) 
end; 
end; 


FIGURE 13.13 Pseudo code of the forward-only computation, in second-order algorithms. 


tables for different neural network topologies with six neurons each. Please notice zero elements are 
for not connected neurons (in the same layers). This can further simplify the computation process for 
popular MLP topologies (Figure 13.12b). 

Most of the used neural networks have many zero elements in the computation table (Figure 13.12). In 
order to reduce the storage requirements (do not store weights with zero values) and to reduce computa- 
tion process (do not perform operations on zero elements), a part of the NBN algorithm in Section 13.3 
was adopted for forward computation. 

In order to further simplify the computation process, Equation 13.21 is completed in two steps 


k-1 
Xk, j= SY) m.081, (13.24) 
i=j 
and 
84,5 = Ok kXk,j = SkXk,j (13.25) 


The complete algorithm with forward-only computation is shown in Figure 13.13. By adding two addi- 
tional steps using Equations 13.24 and 13.25 (highlighted in bold in Figure 13.13), all computations can 
be completed in the forward-only computing process. 


13.4.4 Experimental Results 


Several problems are presented to test the computing speed of two different NBN algorithms—with and 
without backpropagation process. 

‘The testing of time costs for both the backpropagation computation and the forward-only computa- 
tion are divided into forward part and backward part separately. 


13.4.4.1 ASCII Codes to Image Conversion 


This problem is to associate 256 ASCII codes with 256 character images, each of which is made up 
of 7 x 8 pixels (Figure 13.14). So there are 8 bit inputs (inputs of parity-8 problem), 256 patterns, and 


© 2011 by Taylor and Francis Group, LLC 


NBN Algorithm 13-15 


EEE EIPEIiGa 
9 Ge Ba 
Cy Pit GE Uae Lie 
ot] EL [eae =f | oa 
Par le be Es RE 
las Cae OF Ge 
Gos Aes Fes Et le Ee te 


kal 
Li 
i 
ae 
* 
: 


FIGURE 13.14 The first 90 images of ASCII characters. 


TABLE 13.2 Comparison for ASCII Character 
Recognition Problem 


Time Cost (ms/Iteration) 


Computation Relative 
Methods Forward Backward Time (%) 
Backpropagation 8.24 1,028.74 100 
Forward-only 61.13 0.00 5.9 


56 outputs. In order to solve the problem, the structure, 112 neurons in 8-56-56 MLP network, is used to 
train those patterns using the NBN algorithms. The computation time is presented in Table 13.2. 


13.4.4.2 Parity-7 Problem 


Parity-N problems are aimed to associate n-bit binary input data with their parity bits. It is also con- 
sidered to be one of the most difficult problems in neural network training, although it has been solved 
analytically [BDA03]. 

Parity-7 problem is trained with the NBN algorithms, using both the forward-only computation and 
traditional computation separately. Two different network structures are used for training: eight neu- 
rons in 7-7-1 MLP network (64 weights) and three neurons in FCC network (27 weights). Time cost 
comparison is shown in Table 13.3. 


13.4.4.3 Error Correction Problems 


Error correction is an extension of parity-N problems for multiple parity bits. In Figure 13.15, the left 
side is the input data, made up of signal bits and their parity bits, while the right side is the related cor- 
rected signal bits and parity bits as outputs, so number of inputs is equal to the number of outputs. 
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TABLE 13.3 Comparison for Parity-7 Problem 


Computation Time Cost (1s/Iteration) Relative 
Networks Methods Forward Backward _— Time (%) 
MLP Backpropagation 158.57 67.82 100 
Forward-only 229.13 0.00 101.2 
FCC Backpropagation 54.14 31.94 100 
Forward-only 86.30 0.00 100.3 


Corrected 


Neural 


Networks 


FIGURE 13.15 Using neural networks to solve error correction problem; errors in input data can be corrected by 
well-trained neural networks. 


Two error correction experiments are presented, one has 4 bit signal with its 3 bit parity bits as inputs, 
7 outputs, and 128 patterns (16 correct patterns and 112 patterns with errors), using 23 neurons in 7-16-7 
MLP network (247 weights); the other has 8 bit signal with its 4 bit parity bits as inputs, 12 outputs, and 
3328 patterns (256 correct patterns and 3072 patterns with errors), using 42 neurons in 12-30-12 MLP 
network (762 weights). Error patterns with one incorrect bit must be corrected. Both backpropagation 
computation and the forward-only computation were performed with the NBN algorithms. The testing 
results are presented in Table 13.4. 


13.4.4.4 Encoders and Decoders 


Experiment results on 3-to-8 decoder, 8-to-3 encoder, 4-to-16 decoder, and 16-to-4 encoder, using the 
NBN algorithms, are presented in Table 13.5. For 3-to-8 decoder and 8-to-3 encoder, 11 neurons are 
used in 3-3-8 MLP network (44 weights) and 8-83 MLP network (99 weights) respectively; while for 
4-to-16 decoder and 16-to-4 encoder, 20 neurons are used in 4-4-16 MLP network (100 weights) and 
16-16-4 MLP network (340 weights) separately. 

In the encoder and decoder problems, one may notice that for the same number of neurons, the more 
outputs the networks have, the more efficiently the forward-only computation works. 

From the presented experimental results, one may notice that, for networks with multiple outputs, 
the forward-only computation is more efficient than the backpropagation computation; while for single 
output situation, the forward-only computation is slightly worse. 


TABLE 13.4 Comparison for Error Correction Problem 


Time Cost (ms/Iteration) 


Computation Relative 
Problems Methods Forward Backward Time (%) 
4bitsignal Backpropagation 0.43 2.82 100 
Forward-only 1.82 0.00 56 
8 bitsignal Backpropagation 40.59 468.14 100 
Forward-only 175.72 0.00 34.5 
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TABLE 13.5 Comparison for Encoders and Decoders 


Conipetative Time Cost (is/Iteration) nani 
Problems Methods Forward Backward = Time (%) 
3-to-8 Traditional 10.14 55.37 100 
decoder = Forward-only 27.86 0.00 42.5 
8-to-3 Traditional 7.19 26.97 100 
encoder = Forward-only 29.76 0.00 87.1 
4-to-16 Traditional 40.03 557.51 100 
decoder = Forward-only 177.65 0.00 29.7 
16-to-4 Traditional 83.24 244.20 100 
encoder Forward-only = 211.28 0.00 62.5 


13.5 Direct Computation of Quasi-Hessian Matrix 
and Gradient Vector 


Using Equation 13.8 for weight updating, one may notice that the matrix multiplication J’J and J’e have 
to be calculated 


H=Q=JJ (13.26) 


g=J'e (13.27) 


where 
matrix Q is the quasi-Hessian matrix 
g is the gradient vector [YW09] 


Traditionally, the whole Jacobian matrix J is calculated and stored [TM94] for further multiplication 
operation using Equations 13.26 and 13.27. The memory limitation may be caused by the Jacobian matrix 
storage, as described below. 

In the NBN algorithm, quasi-Hessian matrix Q and gradient vector g are calculated directly, without 
Jacobian matrix computation and storage. Therefore, the NBN algorithm can be used in training the 
problems with unlimited number of training patterns. 


13.5.1 Memory Limitation in the LM Algorithm 


In the LM algorithm, Jacobian matrix J has to be calculated and stored for the Hessian matrix com- 
putation [TM94]. In this procedure, as shown in Figure 13.2, at least np x no x nn elements (Jacobian 
matrix) have to be stored. For small and median size pattern training, this method may work smoothly. 
However, it would be a huge memory cost for training large-sized patterns, since the number of elements 
of Jacobian matrix J is proportional to the number of patterns. 

For example, the pattern recognition problem in MNIST handwritten digit database [CKOZ06] con- 
sists of 60,000 training patterns, 784 inputs, and 10 outputs. Using only the simplest possible neural 
network with 10 neurons (one neuron per each output), the memory cost for the entire Jacobian matrix 
storage is nearly 35 Gb. This huge memory requirement cannot be satisfied by any Windows compliers, 
where there is a 3 Gb limitation for single-array storage. Therefore, the LM algorithm cannot be used for 
problems with large number of patterns. 
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13.5.2 Review of Matrix Algebra 


There are two ways to multiply rows and columns of two matrices. If the row of the first matrix is multi- 
plied by the column of the second matrix, then we obtain a scalar, as shown in Figure 13.16a. When the 
column of the first matrix is multiplied by the row of the second matrix then the result is a partial matrix q 
(Figure 13.16b) [L05]. The number of scalars is nn x nn, while the number of partial matrices q, which 
later have to be summed, is np x no. 

When J? is multiplied by J using routine shown in Figure 13.16b, at first, partial matrices q (size: 
nn x nn) need to be calculated mp x no times, then all of np x no matrices q must be summed together. 
The routine of Figure 13.16b seems complicated; therefore, almost all matrix multiplication processes 
use the routine of Figure 13.16a, where only one element of resulted matrix is calculated and stored 
each time. 

Even the routine of Figure 13.16b seems to be more complicated; after detailed analysis (see Table 13.6), 
one may conclude that the computation time for matrix multiplication of the two ways is basically the 
same. 

In a specific case of neural network training, only one row of Jacobian matrix J (column of J’) is 
known for each training pattern, so if routine from Figure 13.16b is used then the process of creation 
of the quasi-Hessian matrix can start sooner without the necessity of computing and storing the entire 
Jacobian matrix for all patterns and all outputs. 

Table 13.7 roughly estimates the memory cost in two multiplication methods separately. 

The analytical results in Table 13.7 show that the column-row multiplication (Figure 13.16b) can save 
a lot of memory. 


<— mn —>» 
——— sl O t 
<—— np xno —> 
jr x || |npxnoJ | = Q as 


(a) 


+ nn —> 1 


' 
; * | 
(b) 


FIGURE 13.16 Two ways of multiplying matrices: (a) row-column multiplication results in a scalar and (b) col- 
umn-row multiplication results in a partial matrix q. 


TABLE 13.6 Computation Analysis 


Multiplication Methods Addition Multiplication 
Row-column (np xno)xnnxnn (np xno) xnnx nn 
Column-row nnxnnx(np xno) nn x nn x (np x no) 


np \s the number of training patterns, no is the number of outputs, 
and nn is the number of weights. 
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TABLE 13.7 Memory Cost Analysis 


Multiplication Methods Elements for Storage 
Row-column (np x no) x nn+nnx nn+nn 
Column-row nn xnn+nn 

Difference (np x no) x nn 


13.5.3 Quasi-Hessian Matrix Computation 


Let us introduce the quasi-Hessian submatrix q,,, (size: nn x nn) 


lpm) 8pm pm Dl pm DC p.m 
ow, ow, Ow OW, OWnn 
OC p,.m Vl p,m be ) 0pm VC p,m 
Gpm=| dw, AW, Ow, OW, AWny (13.28) 
O€p,m Vlpm VL p,m VL p.m 0 p.m i 
OW, OW, OWnn OW OWnn 


Using the procedure in Figure 13.5b, the nn x nn quasi-Hessian matrix Q can be calculated as the sum 


of submatrices qp », 
np no 
Q= yx (13.29) 
p=1 m=1 
By introducing 1 x mn vector jp », 
i. O€pm 9pm pm (13.30) 
ow, Ow, OWnn 


submatrices q,,,, in Equation 13.13 can be also written in the vector form (Figure 13.5b) 
Qp.m = i (13.31) 


One may notice that for the computation of submatrices q,,,,, only N elements of vector j,,,, need to be 
calculated and stored. All the submatrices can be calculated for each pattern p and output m separately, 
and summed together, so as to obtain the quasi-Hessian matrix Q. 

Considering the independence among all patterns and outputs, there is no need to store all the 
quasi-Hessian submatrices q,,,,. Each submatrix can be summed to a temporary matrix after its com- 
putation. Therefore, during the direct computation of the quasi-Hessian matrix Q using (13.29), only 
memory for nn elements is required, instead of that for the whole Jacobian matrix with (np x no) x nn 
elements (Table 13.7). 

From (13.13), one may notice that all the submatrices q,,,,, are symmetrical. With this property, only 
upper (or lower) triangular elements of those submatrices need to be calculated. Therefore, during the 
improved quasi-Hessian matrix Q computation, multiplication operations in (13.31) and sum opera- 
tions in (13.29) can be both reduced by half approximately. 
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13.5.4 Gradient Vector Computation 


Gradient subvector n,,,,, (size: nn x 1) is 


pm OC pm 
aw, ow, 
0e pm Oe p.m 
Nom =| OW. P| =| OW. |X pm (13.32) 
0€ p.m 0 p.m 
OWan | | OWnn 


With the procedure in Figure 13.16b, gradient vector g can be calculated as the sum of gradient sub- 
vector Ny in 


g= >> (13.33) 


p=l m=1 
Using the same vector j,,,,, defined in (13.30), gradient subvector can be calculated using 


Nom = Jp.mp.m (13.34) 


Similarly, the gradient subvector n,,,, can be calculated for each pattern and output separately, and 
summed to a temporary vector. Since the same vector j,,,,, is calculated during the quasi-Hessian matrix 
computation above, there is only an extra scalar e, ,, need to be stored. 

With the improved computation, both the quasi-Hessian matrix Q and gradient vector g can be 
computed directly, without the Jacobian matrix storage and multiplication. During the process, only 
a temporary vector j,,,, with nn elements needs to be stored; in other words, the memory cost for the 
Jacobian matrix storage is reduced by (np x no) times. In the MINST problem mentioned in Section 
13.5.1, the memory cost for the storage of Jacobian elements could be reduced from more than 35 GB 
to nearly 30.7kB. 


13.5.5 Jacobian Row Computation 


The key point of the improved computation above for quasi-Hessian matrix Q and gradient vector g is to 
calculate vector j,,,, defined in (13.30) for each pattern and output. This vector is equivalent to one row 
of the Jacobian matrix J. 

By combining Equations 13.12 and 13.30, the elements of vector j, ,, can be calculated by 


jpm = [Sea os a Te Boma lL Yona as ee } | (13.35) 


where y, ;; is the ith input of neuron j, when training pattern p. 

Using the NBN procedure introduced in Section 13.3, all elements y,;; in Equation 13.35 can 
be calculated in the forward computation, while vector 6 is obtained in the backward computation; 
or, using the improved NBN procedure in Section 13.4, both vectors y and 6 can be obtained in the 
improved forward computation. Again, since only one vector j,,,, needs to be stored for each pattern 
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% Initialization 


Q=0; 

g=0 

% Improved computation 

for p=l:np % Number of patterns 


% Forward computation 


for ml: no & Number of outputs 


2 


% Backward computation 


calcul ate vector j, 47 % Eq. (35) 
calcul ate sub natrixq, 7 % Eq. (31) 
calcul ate sub vectorn,»7 % Eq. (34) 
Q=O+G, ni % Eq. (29) 
S=9tNy ni % Eq. (33) 
end; 
end; 


FIGURE 13.17 Pseudo code of the improved computation for the quasi-Hessian matrix and gradient vector in 
NBN algorithm. 


and output in the improved computation, the memory cost for all those temporary parameters can be 
reduced by (np x no) times. All matrix operations are simplified to vector operations. 

Generally, for the problem with np patterns and no outputs, the NBN algorithm without the Jacobian 
matrix storage can be organized as the pseudo code shown in Figure 13.17. 


13.5.6 Comparison on Memory and Time Consumption 


Several experiments are designed to test the memory and time efficiencies of the NBN algorithm, com- 
paring with the traditional LM algorithm. They are divided into two parts: (1) memory comparison and 
(2) time comparison. 


13.5.6.1 Memory Comparison 


Three problems, each of which has huge number of patterns, are selected to test the memory cost of both 
the traditional computation and the improved computation. LM algorithm and NBN algorithm are 
used for training, and the test results are shown in Tables 13.8 and 13.9. In order to make a more precise 
comparison, memory cost for program code and input files were not used in the comparison. 


TABLE 13.8 Memory Comparison for Parity 


Problems 

Parity-N problems N=14 N=16 
Patterns 16,384 65,536 
Structures? 15 neurons 17 neurons 
Jacobian matrix sizes 5,406,720 27,852,800 
Weight vector sizes 330 425 
Average iteration 99,2 166.4 
Success rate (%) 13 9 
Algorithms Actual Memory Cost (Mb) 
LM algorithm 79.21 385.22 
NBN algorithm 3.41 4.3 


a All neurons are in FCC networks. 
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TABLE 13.9 Memory Comparison for MINST 


Problem 

Problem MINST 

Patterns 60,000 

Structures 784 = 1 single layer network* 
Jacobian matrix sizes 47,100,000 

Weight vector sizes 785 

Algorithms Actual Memory Cost (Mb) 
LM algorithm 385.68 

NBN algorithm 15.67 


* In order to perform efficient matrix inversion during 
training, only one of ten digits is classified each time. 


TABLE 13.10 Time Comparison for Parity Problems 
Parity-N Problems N=9 N=11 N=13 N=15 


Patterns 512 2,048 8,192 32,768 
Neurons 10 12 14 16 
Weights 145 210 287 376 
Average iterations 38.51 59.02 68.08 126.08 
Success rate (%) 58 37 24 12 
Algorithms Averaged Training Time (s) 
Traditional LM 0.78 68.01 1508.46 43,417.06 
Improved LM 0.33 22.09 173.79 2,797.93 


From the test results in Tables 13.8 and 13.9, it is clear that memory cost for training is significantly 
reduced in the improved computation. 


13.5.6.2 Time Comparison 


Parity-N problems are presented to test the training time for both traditional computation and the 
improved computation using the LM algorithm. The structures used for testing are all FCC networks. 
For each problem, the initial weights and training parameters are the same. 

From Table 13.10, one may notice that the NBN computation cannot only handle much larger prob- 
lems, but also computes much faster than the LM algorithm, especially for large-sized pattern training. 
The larger the pattern size is, the more time efficient the improved computation will be. 


13.6 Conclusion 


In this chapter, the NBN algorithm is introduced to solve the structure and memory limitation in the 
LM algorithm. Based on the specially designed NBN routings, the NBN algorithm can be used not only 
for traditional MLP networks, but also other arbitrarily connected neural networks. 

The NBN algorithm can be organized in two procedures—with backpropagation process and without 
backpropagation process. Experimental results show that the former one is suitable for networks with 
single output, while the latter one is more efficient for networks with multiple outputs. 

The NBN algorithm does not require to store and to multiply large Jacobian matrix. As a consequence, 
memory requirement for the quasi-Hessian matrix and gradient vector computation is decreased 
by (P x M) times, where P is the number of patterns and M is the number of outputs. An additional 
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benefit of memory reduction is also a significant reduction in computation time. Therefore, the train- 
ing speed of the NBN algorithm becomes much faster than the traditional LM algorithm [W09,ISIE10]. 

In the NBN algorithm, the quasi-Hessian matrix can be computed on fly when training patterns are 
applied. Moreover, it has the special advantage for applications which require dynamically changing the 
number of training patterns. There is no need to repeat the entire multiplication of J’J, but only add to 
or subtract from the quasi-Hessian matrix. The quasi-Hessian matrix can be modified as patterns are 
applied or removed. 

There are two implementations of the NBN algorithm on the website: http://www.eng.auburn. 
edu/~wilambm/nnt/index.htm. MATLAB® version can handle arbitrarily connected networks, but the 
Jacobian matrix is computed and stored [WCHKO07]. In the C++ version [Y W09], all new features of the 
NBN algorithm mentioned in this chapter are implemented. 
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14.1 Introduction 


Feedforward neural networks such as the multilayer perceptron (MLP) are some of the most popular 
artificial neural network structures being used today. MLPs have been the subject of intensive research 
efforts in recent years because of their interesting learning and generalization capacity and applicability 
to a variety of classification, approximation, modeling, identification, and control problems. 

The classical, the simplest, and the most used method for training an MLP is the steepest descent back- 
propagation algorithm, known also as the standard backpropagation (SBP) algorithm. Unfortunately, 
this algorithm suffers from a number of shortcomings, mainly the slow learning rate. Therefore, many 
researchers have been interested in accelerating the learning with this algorithm, or in proposing new, 
fast learning algorithms. 

Fast learning algorithms constituted an appealing area of research in 1988 and 1989. Since then, there 
have been many attempts to find fast training algorithms, and consequently a number of learning algo- 
rithms have been proposed with significant progress being made on these and related issues. 

In this chapter, a survey of different neural network fast training procedures is presented. More than 
a decade of progress in accelerating the feedforward neural network learning algorithms is reviewed. 
An overview of further up-to-date new techniques is also discussed. Different algorithms and techniques 
are presented in unified forms and are discussed with particular emphasis on their corresponding 
behavior, including the reduction of the iteration number, and their computational complexities, gener- 
alization capacities, and other parameters. Experimental results on benchmark applications are deliv- 
ered, which allows a comparison of the performances of some algorithms with respect to others. 


14.2 Review of the Multilayer Perceptron 


Many models of neurons have been proposed in the literature [1,2]. The one most used is called the 
perceptron, and is given in Figure 14.1. 

The subscript j stands for the number of the neurons in the network while s is the number of the 
corresponding layers. Based on this model, the MLP is constructed, as indicated in Figure 14.2. 

Each layer consists of n, neurons (n,1,...,L) and n,_, + 1 inputs. The first input of each layer is a bias input 
(typical values can be equal to 0.5 or 1). The first layer with n) + 1 inputs is the input layer. The Lth layer 
with n, nodes is the output layer. 


MLP rules: 
For any neuron j in a layer s, the output signal is defined by 


yi = f (ul?) (14.1) 


we! 
{s—1] JO 
Yo 
[s—1] ; 
oe {s]: 
Jj 3 
ici Nonlinear part 
Yang 


Linear part 


FIGURE 14.1 A model for a single neuron (perceptron) in an MLP. 
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FIGURE 14.2 Fully connected feedforward MLP. 


where ul is the linear output signal and f is an activation function assumed to be differentiable 
over R: 


Ns-1 T 
ull = SY wll = (w}") yl (14.2) 
i=0 


where 
WH [wow wwe) forfelicngs= 1D 


JL js 


TF 
yela [v# ytl.y!] fors=0...L-1 
The terms yh! depict the bias terms. 


14.3 Review of the Standard Backpropagation Algorithm 


The backpropagation algorithm is the standard algorithm used for training an MLP. It is a generalized 
least mean square (LMS) algorithm that minimizes a cost function equal to the sum of the squares of 
the error signals between the actual and the desired outputs. 

Let us define the total error function for all output neurons and for the current pattern as 


ny 


E, = Col (14.3) 


j=l 
where elt] depicts the nonlinear error of the jth output unit and it is given by 
tH) yl (14.4) 


[L] {L] 
where d; "and y; 


layer, respectively. 
In the literature, E, is called the performance index and is expressed by the more general form [2,40]: 


are the desired and the current output signals of the corresponding unit in the output 
nL 

E,= > 0(6) (14.5) 
j=l 
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In this case, o(.) becomes an error output function typically convex called the loss or weighting function. 
In the case of o(e;,) = 1/2 Cis we retrieve the ordinary L,-norm criterion [2]. 


The criterion to be minimized is the sum of all error over all the training patterns: 


E= ve, (14.6) 


p=l 


The SBP algorithm is derived using the gradient descent method expressed by 


OE 


[s] als] [s] = 
Aw) (k) = wii (k) — wii (k -1) Pani 


(14.7) 


where 
Ul is a positive real constant called the learning rate. 
Note that minimizing E is equivalent to minimizing E,. 
In what follows, the subscript k will be omitted for simplification 


In vector notation, we can express Equation 14.7 for the neuron j (i-e., for all synaptic weights of the 
jth neuron) in a layer s as follows: 


Aw}! =-n(V!"E,)) (14.8) 
where 
OB) 8B). > “By. | 
ViI(E,) = i depeted eee P (14.9) 
Jak owl)’ dwt! awit) 


The evaluation of this gradient vector changes according to whether the derivative is computed with 
respect to the synaptic weights of the output layer or any of the hidden layers. 
For a neuron j in the output layer L, 


T 
vie y=L[ ee) J A ee) eh gfe) te a0) 


For the entire number of the output neurons, the gradient signals vector is given by 
T T ny" 
vile.) =| (V" (vy) sal VE (14.11) 
For a neuron j in a hidden layer s, Equation 14.9 becomes 


T 
VED =Le (ull) PALL ua) eh Lar (ue) ef yy aaa 
where s = (L - 1), ..., land 


Ns+1 


eff = Si (wi) (14.13) 


r=1 
ell is assumed to be the estimated error in the hidden layer. Note that these estimated errors are 


computed in a backward direction from layer (L-1) to 1. 
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For all the neurons in the sth layer, 
iD nt EE 4 
VIE.) = (v"" (Ve yona (ve) (14.14) 


Let us define the general gradient vector of E, for all neural network synaptic weights as 


T 


VE, = VE,(W)= (Ee). (VE) a. (ve) | (14.15) 


where 


w= [Cry (wey) (wey (WEY. 


(wi, (wee), (win’,.. (win) ] (14.16) 


The general updating rule for all the synaptic weights in the network becomes 
AW(k)=UVE,(W) (14.17) 


As mentioned above, the learning of the MLP using the SBP algorithm is plagued by slow convergence. 
Eight different approaches for increasing the convergence speed are summarized as follows. 


14.4 Different Approaches for Increasing the Learning Speed 
14.4.1 Weight Updating Procedure 


We distinguish the online (incremental) method and the batch method. In the first, a pattern (a learning 
example) is presented at the input and then all weights are updated before the next pattern is presented. 
In the batch method, the weight changes Aw are accumulated over some number (usually all) of the 
learning examples before the weights are actually changed. 

Practically, we have found that the convergence speeds of these two methods are similar. 


14.4.2 Principles of Learning 


In the learning phase, in each iteration, patterns can be selected arbitrarily or in a certain order. The 
order of presenting data during the learning phase affects the learning speed most. In general, presenting 
data in a certain order yields slightly faster training. 


14.4.3 Estimation of Optimal Initial Conditions 


In the SBP algorithm, the user always starts with random initial weight values. Finding optimal initial 
weights to start the learning phase can considerably improve the convergence speed. 


14.4.4 Reduction of the Data Size 


In many applications (namely, in signal processing or image processing), one has to deal with data 
of huge dimensions. The use of these data in their initial forms becomes intractable. Preprocessing 
data, for example, by extracting features or by using projection onto a new basis speeds up the 
learning process and simplifies the use of the NN. 
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14.4.5 Estimation of the Optimal NN Structure 


Usually, the NN structure is evaluated by a trial-and-error approach. Starting with the optimal NN 
structure, i.e., the optimal number of the hidden layers and their corresponding number of neurons, 
would considerably reduce the training time needed. The interested reader can find a chapter on MLP 
pruning algorithms used to determine the optimal structure in this handbook. However it was recently 
shown that MLP architectures, as shown in Figure 14.2, are not as powerful as other neural network 
architectures such as FCC and BMLP networks [6,13,20]. 


14.4.6 Use of Adaptive Parameters 


The use of an adaptive slope of the activation function or a global adaptation of the learning rate and/or 
momentum rate can increase the convergence speed in some applications. 


14.4.7 Choice of the Optimization Criterion 


In order to improve the learning speed or the generalization capacities, many other sophisticated opti- 
mization criteria can be used. The standard (L,-norm) least squares criterion is not the only cost func- 
tion to be used for deriving the synaptic weights. Indeed, when signals are corrupted with non-Gaussian 
noise, the standard L,-norm cost function performs badly. This will be discussed later. 


14.4.8 Application of More Advanced Algorithms 


Numerous heuristic optimization algorithms have been proposed to improve the convergence speed of 
the SBP algorithm. Unfortunately, some of these algorithms are computationally very costly and time 
consuming, i.e., they require a large increase of storage and computational cost, which can become 
unmanageable even for a moderate size of neural network. 

As we will see later, the first five possibilities depend on the learning algorithm, and despite the 
multiple attempts to develop theories that help to find optimal initial weights or initial neural network 
structures etc., there exist neither interesting results nor universal rules or theory allowing this. 

The three remaining possibilities are related to the algorithm itself. The search for new algorithms or 
new optimization functions has made good progress and has yielded good results. 

In spite of the big variations in the proposed algorithms, they fall roughly into two categories. 

The first category involves the development of algorithms based on first-order optimization methods 
(FOOM). This is the case for the SBP algorithm developed above. 

Assume that E(w) is a cost function to minimize with respect to the parameter vector w, and VE(w) 
is the gradient vector of E(w) with respect to w. The FOOMs are based on the following rule: 


Aw ==" = -uVE(w) (14.18) 
w 


which is known in the literature as the steepest descent algorithm or the gradient descent method [1-4]. 
Ul is a positive constant that governs the amplitude of the correction applied to w in each iteration and 
thus governs the convergence speed. 

‘The second category involves the development of algorithms based on the second-order optimization 
methods (SOOM). This aims to accelerate the learning speed of the MLP, too. All of these methods are 
based on the following rule: 


A(w) =-[V*Ew) | VEO) (14.19) 
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where VE(w) is the matrix of second-order derivative of E(w) with respect to w, known as the Hessian 
matrix. This method is known in the literature as Newton’s method [21,22]. It is known by its high 
convergence speed but it needs the computation of the Hessian matrix inverse [VE(w)]"'. Dimensions 
of this matrix grow with those of the network size, and in practice it is a very difficult task to find the 
exact value of this matrix. For this reason, a lot of research has been focused on finding an approxima- 
tion to this matrix (ie. the Hessian matrix inverse). The most popular approaches have used quasi- 
Newton methods (i.e., the conjugate gradient of secant methods). All of them proceed by approximating 
[VE(w)]}"1, and are considered to be more efficient, but their storage and their computational require- 
ments go up as the square of the network size [7]. As mentioned above, these algorithms are faster than 
those based on FOOM, but because of their complexity, neural networks researchers and users prefer 
using the SBP algorithm to benefit from its simplicity and are very interested in finding a modified ver- 
sion of this algorithm that is faster than the SBP version. 


14.5 Different Approaches to Speed Up the SBP Algorithm 


Several parameters in the SBP algorithm can be updated during the learning phase for the purpose of 
accelerating the convergence speed. These parameters are the learning coefficient 1, the momentum 
term ©, and even the activation function slope. We summarize below all the suggested ideas in this 
regard. 


14.5.1 Updating the Learning Rate 


From about 1988 several authors have been interested in improving the convergence speed of the SBP 
algorithm by updating the learning rate [2,3,5,12]. Several rules have been proposed for this purpose. 
Some of them are effectively interesting but others do not differ too much from the algorithm with con- 
stant step. 

We shall review the multiple suggestions for the updating rules of the learning rate 1 and/or the 
momentum term Q in the order of their appearance: 


¢ In 1988, two approaches were published, namely, the quickprop (QP) algorithm by Fahlman in 
[24] and the delta bar delta (DBD) rule by Jacobs in [10]. 
¢ Inthe QP algorithm, the activation function is equal to 


fle) = Gay TOM (14.20) 


where u; is given by (14.2). 
The simplified QP algorithm can be summarized in the following rules [2]: 


YPAwik-1, if Awj(k-1) 40 


Aw i(k) = 
Ht) Ho aE ; Aw; (k —1) =0 (14.21) 
Ow; 
where 
dE(w™?) 
OW ji 
(3 eee i 
te Sam dE(w*?) JEW) > Ymax (14.22) 
OW ji OW ji 


and the parameters y,,,, and [ly are typically chosen equal to 0.01 <u, < 0.6 and ¥,,,, = 1.75. 
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The DBD algorithm assumes that each synaptic coefficient has its own learning rate. The updating 
rule of each learning coefficient 1!) depends on the sign of the local gradient 8!'!(k) = dE/dw'i! in 
each iteration: 


ulk-)+oa if vi(k-18\(k) >0 


uw (k) 7 wi (k -1) if vil(k = 1)8!)\(k) <0 (14.23) 
0 otherwise 


where 
o and B are arbitrarily parameters 
vii (k) = (1— v)85(k) + Avie (kV) 
A is a positive real smaller than 1 [2] 


In 1990, the Super SAB algorithm was proposed by Tollenaere in Ref. [25]. It represents a slight 
modification of the DBD one. Each synaptic weight w,; has its own learning rate such that 


Aw i(k) = Wj = + yAw i(k -1) (14.24) 
ji 
where 
(k) (k-1) 
oi if dE(w”) JE(w) or 
ui = dw Wii (14.25) 
Bus? otherwise 
and o = 1/B. 


In 1991, Darken and Moody in [8] have suggested a modification of the learning rate along the 
training phase such that 


1 
k) = Uy ——— 14.2 
u(k) ees (14.26) 
or 
+t 
uk) = Ho Ho Mo (14.27) 
ck k 
1+—— +k,| — 
Mo ko ko 
where 


cand k, are positive constants 
Hy is the initial value of the learning coefficient U 


These are called the “search-then-converge strategy: STCS” algorithms. 
The plots of the evolution of Lt versus time for formula (14.26) for two different values of ky are 
shown in Figure 14.3. 
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FIGURE 14.3 Evolution of versus time for different values of ky. 


Note that at the beginning of the learning phase (search phase) il is relatively large. This ensures 
the rapidity of the algorithm when starting. As the training progresses, [1 becomes smaller (the 
convergence phase). This avoids oscillations at the end of the learning phase and ensures a smooth 
convergence, but it decreases the speed of convergence. 

¢ In 1992, the famous conjugate gradient method was developed separately by Johansson [11] and 
Battiti [26]. The major advantage of the method lies in the use of the second-order derivative of 
the error signal. In the case of a quadratic error surface, the method needs only N iterations for 
convergence (N is the number of synaptic weights of the neural network). The algorithm can be 
summarized into [11] 


Aw(k) = u(k) dr(k) (14.28) 


where dr(k) is the descent direction. Many expressions have been proposed to update the descent 
direction. The most widely used formula is given by 


dr(k) =-VE(w(k)) + B(K)dr(k -1) (14.29) 


where 
VE(w(k)) = 0E/dw(k) 
B(K) is in general computed using the Polak—-Ribiere formula: 


- [ VE(w(k)) — VE(w(k - »)] VE(w(k)) 
VE(w(k-1))’ VE(w(k-1)) 


B(k) (14.30) 


¢ In 1994, two other approaches were developed, the RPROP algorithm, Riedmiller [27] and the 
accelerated backpropagation algorithm (ABP algorithm), Parlos et al. [23]. 
The RPROP algorithm is characterized by the use of the sign of the gradient instead of its 
numeric value in the updating equations: 
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where 


12A,(k—-1) _ if VE(w)(k))VE(w,(k-1)) > 0 
Aj(k)=40.5A;(kK-1) if VE(w,(k))VE(w;(k-1)) <0 (14.32) 


0 otherwise 


In the ABP algorithm, the batch update principle is suggested, and weights are changed according to 


where P(E) is a function of the error signal. Different expressions were suggested for this, such as 


p(E)=py; p(E)= WE; or ae)=nranh( =| 


where Ul and E, are constant, nonnegative numbers. The choice of p(E) defines the decay mode in 
the search space. 

¢ In 1996, another algorithm called SASS was proposed by Hannan et al. [28]. The updating equa- 
tions are the same as for RPROP (14.31), but A,,(x) is given by 


2A,(k-1) if gulk)gy(kK-1)20 and g,(k)gi(k—2)>0 


ees 14.34 
ilk) fees otherwise ( ) 


This approach seems to be similar to that for RPROP, but experimental results show that it behaves 
differently in several cases. 

Referring to the works of Alpsan [29], Hannan [31], and Smagt [30], we can conclude that the 
RPROP algorithm is the fastest among the eight described above. In [31], Hannan et al. state that 
conclusions about the general performance of the algorithms cannot be drawn. Indeed, general- 
ized conclusions about algorithm properties and speed can neither be based on some simulation 
results nor on inspecting some statistics. 

¢ In 2006, a new approach was proposed by L. Behera et al. [12] using the Lyapunov function. This 
approach seems to be more effective than those mentioned above. The following optimization criterion 


E= “Sy (a a yin) (14.35) 


can be expressed as a Lyapunov function such that 


Vi= <9) (14.36) 


where 


T 


p= [dt — yh. nd — yi | 
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Then, the learning rate is updated during the learning phase according to 


~ 1/2 
n= tet (14.37) 
"5 
where | is a starting constant to be chosen arbitrarily at the beginning of the learning phase 
and J is defined as 
yl" ‘ 
=7—e Kr 14.38 
err (14.38) 


and represents the instantaneous value of the Jacobian matrix. 
This method has a significantly high convergence speed, owing to the updating of the learning 
rate using information provided by the Jacobian matrix. 


14.5.2 Updating the Activation Function Slope 


In [9], Krushke et al. have shown that the speed of the SBP algorithm can be increased by updating 
the slope of the activation function, a. 
To find the updating rule for the slope, we apply the gradient method with respect to a: 


Aa(k) =—Ll, oh (14.39) 


14.6 Some Simulation Results 


To compare the performance of the different algorithms presented, it is first necessary to define the sen- 
sitivity to the initialization of the synaptic weights and the generalization capacity of a neural network. 


14.6.1 Evaluation of the Sensitivity to the Initialization 
of the Synaptic Weights 


It is known that the convergence of all iterative search methods depends essentially on the starting point 
chosen. In our case, these are the initial synaptic coefficients (weights). In the case of poor initialization, 
the iterative algorithm may diverge. As we will see below, there is no rule for defining or choosing a 
priori the initial departure point or even the range in which it lies. There are many algorithms that are 
very sensitive to the initialization while others are less sensitive. To study the sensitivity to the initial- 
ization of the weights, we have to test the convergence of the algorithm for a huge number of different, 
randomly initialized trials (Monte Carlo test). By a trial, we mean one training phase with one random 
weight initialization. The ending criterion is equal to the mean squared error for all the output neurons 
and for all the training patterns: 


E= Oy Si (er) (14.40) 


where N is the total number of training patterns. 
Each training phase is stopped if E reaches a threshold fixed beforehand. This threshold is selected 
depending on the application, and it will be denoted in what follows as ending_ threshold. Each learning 
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trial must be stopped if the ending threshold is not reached after an iteration number fixed a priori. 

The choice of this number also depends on the application. This number is denoted by iter_ number. The 

convergence is assumed to have failed if iter_ number is reached before the value of ending_ threshold. 
The sensitivity to weight initialization is evaluated via the proposed formula: 


(14.41) 


Number of convergent trials 
S,,(%) =100-| 1 
Total number of trials 


Thus, the smaller S,,, the less sensitive is the algorithm to weight initialization. 


14.6.2 Study of the Generalization Capability 


To study the generalization capability (G, after the training phase, we should present new patterns 
(testing patterns), whose desired outputs are known, to the network and compare the actual neural out- 
puts with the desired ones. If the norm of the error between these two inputs is smaller than a threshold 
chosen beforehand (denoted in what follows as gen_ threshold), then the new pattern is assumed to be 
recognized with success. 

The generalization capability is evaluated via the following proposed formula: 


Recognized pattern number 


G.(%) =100- (14.42) 


Total testing patterns number 


Thus, the larger G,, the better the generalization capability of the network. 


14.6.3 Simulation Results and Performance Comparison 


In these simulation tests, we have compared the performances of the presented algorithms with that 
of the conventional SBP algorithm. For this purpose, all algorithms are used to train networks for the 
same problem. We present two examples here, the 4-b parity checker (logic problem), and the circle-in- 
the-square problem (analog problem). For all algorithms, learning parameters (such as pL, A, B,...) are 
selected after many trials (100) to maximize the performance of each algorithm. However, an exhaustive 
search for the best possible parameters is beyond the scope of this work, and optimal values may exist 
for each algorithm. In order to make suitable comparisons, we kept the same neural network size for 
testing all the training algorithms. 


‘The problem of choosing the learning parameters 
Like all optimization methods that are based on the steepest descent of the gradient, the convergence of 
the algorithms is strongly related to 


¢ The choice of the learning parameters such as hl, A, B, ... 
¢ ‘The initial conditions, namely, the initial synaptic coefficients. 


The learning parameters govern the amplitudes of the correction terms and consequently affect the 
stability of the algorithm. To date, there is no practical guideline that allows the computation or even 
the choice of these parameters in an optimal way. In the literature, we can only find some attempts 
(even heuristic) to give formulae which contribute to speeding up the convergence or to stabilizing the 
algorithm [44-48]. 

The same applies to the initial choice of the synaptic coefficients. They are generally chosen in an 
arbitrarily manner, and there are no laws for defining their values a priori. 

Note that the initial synaptic coefficients fix the point from which the descent will start in the opposite 
direction of the gradient. Consequently, the algorithm will converge in reasonable time if this starting 
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point is near an acceptable local minimum or preferably a global minimum. Otherwise the algorithm 
will be trapped in the nearest local minimum or will not converge in a reasonable time. 

In conclusion, like many other neural network parameters (such as the hidden layer number, the 
number of neurons per hidden layer, the nature of the activation function etc.), adequate values of these 
parameters can only be found by experiment assisted by expertise. 


14.6.3.1 4-b Parity Checker 


The aim of this application is to determine the parity of a 4-bit binary number. The neural network 
inputs are logic values (0.9 for the higher level and —0.9 for the lower level). At each iteration, we present 
to the network the 16 input combinations with their desired outputs (0.1 for the lower level and 0.9 for 
the higher one). The network size is (4,8,2,1), i.e., 4 inputs, two hidden layers with 8 and 2 neurons, and 
one output neuron. The synaptic coefficients are initialized randomly in the range [-3, +3]. 

To evaluate the sensitivity to initialization of the weights S,, we have chosen iter_number = 500 and 
ending_ threshold = 10-°. 

For the generalization test G,, we have presented 4-bit distorted numbers to the network. The distor- 
tion rate with respect to the exact 4-bit binary numbers is about 30%, and gen_ threshold = 0.1. 


Note that TABLE 14.1 Comparison of 


+ We have followed the same procedure to determine the per- _ the Performance of the Different 
formances of each algorithm for all the simulation examples. _ A/gorithms with respect to the 
+ Although these results were obtained after several experiments (in SBP Algorithm for the 4-b Parity 


order to maximize the performance of each algorithm), they may Checker 
beslightly changed depending on thevalues of ending_ threshold, Sy (%) G,(%) 
iter_number, gen_ threshold, and even on the range of the initial opp 74 88 
synaptic coefficients and their statistical distribution. QP 6 88 
DBD 76 89 


Tables 14.1 and 14.2 summarize the performance of all the algorithms 
for this application. From these results, we note that the new MLMSF_—StPer SAB ia id 


network is less affected by the choice of the initial weights and has a ans ia = 
good generalization capability with respect to the SBP algorithm. sot bs 7S 
RPROP 78 94 
14.6.3.1.1 Conclusion ABP 75 90 
, SASS 76 89 

From these results, one concludes that all the proposed algorithms 
. ats are LSBP 79 97 

have almost the same performance in the sensitivity to the initializa- 


tion, the generalization capacity, and the time gain. The CG, RPROP, rr ea 


and the LSBP algorithms have a slight superiority with respect to the 
other algorithms. However, the speed of convergence remains less TABLE14.2 Improvement 


than expected. Ratios with respect to the SBP 
Algorithm 
14.6.3.2 Circle-in-the-Square Problem In Iteration In Time 
In this application, the neural network has to decide whether a point QP 1.18 1.10 
with coordinates (x,y), varying from —0.5 to +0.5, is in the circle of DBD Lak 1.20 
radius equal to 0.35 [1]. Training patterns, which alternate between SuPer SAB 1.35 1.13 
the two classes (inside and the outside the circle), are presented to the ars Le 1.21 
network. In each iteration, we present 100 input/output patterns to the ae it 125 
network. The networks size is (2,8,2,1) and the synaptic coefficients are sabes acu a 
initialized randomly in the range [-1,+1]. ABP 1.19 1.11 
Toevaluate S,, we have chosen iter_number=200, ending_ threshold = a os 
10-*. For G,, we have presented to the network a new coordinate (x,y), and : ; 


we have chosen gen _ threshold = 0.1. ioc A 
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Tables 14.3 and 14.4 summarize the performances of the algorithms TABLE 14.3 Comparison of the 
for this application. Performances of the Different 
Algorithms with respect to the 
SBP Algorithm for the Circle-in- 
14.6.3.2.1 Conclusion the-Square Problem 
For this analog problem, one notes that all the algorithms have almost 


S,(%) — G(%) 
the same performance. For this reason, several algorithms were pro- 


posed to increase the convergence of the learning algorithms for the SP? i sa 
multilayer perceptron networks. OF = = 
DBD 68 83 

Super SAB 68 85 

14.7 Backpropagation Algorithms i a 
with Different Optimization Criteria — i 7 

The choice of the training algorithm determines the rate of convergence, ee - 
the time required to reach the solution, and the optimality of the latter. reap a so 
In the field of neural networks, training algorithms may differ from vi a a 


that of the SBP algorithm by the optimization criterion and/or by the 
method with which updating equations are derived. 

Different forms of optimization criterion have been proposed in order to increase the convergence 
speed and/or to improve the generalization capability, and two algorithms using more advance optimi- 
zation procedures have been published. 

In [14] Karayiannis et al. have developed the following generalized criterion for training an MLP: 


E, =(1- NY Gilep) +2} Gre) (14.43) 
j=l j=l 


where 6,(e;,) and 6,(e,,) are known as loss functions that must be convex and differentiable, and [0,1]. 
Inspired by this equation we have proposed [1] a new learning algorithm that is remarkably faster 
than the SBP algorithm. It is based on the following criterion: 


ny 1 ny 1 
E, = ys + ype (14.44) 
j=l j=l 


TABLE 14.4 Improvement 
where e, and e, are the nonlinear and the linear output errors, respec- _ Ratios with respect to the 


tively. To work with a system of linear equations, the authorsin [16-19] SBP Algorithm 
used an inversion of the output layer nonlinearity and an estimation 


of the desired output the hidden layers. Then they applied the recur- har danid 


sive least-squares (RLS) algorithm at each layer yielding a fast train- QP 1.10 1.03 
ing algorithm. To avoid the inversion of the output nonlinearity, the DBD 1.13 1.10 
authors in [15] used the standard threshold logic-type nonlinearity as Super SAB 1.12 1.06 
an approximation of the sigmoid. These approaches yield fast train- STS 1.12 1.12 
ing with respect to the SBP algorithm but still are approximation © 1.18 1.20 
dependa nt. RPROP 1.20 1.20 

At the same time, a real-time learning algorithm based on the EKF tech- ABP 1.13 1.05 
nique was developed [32-36]. In these works, a Kalman filter is assigned to SASS 1.14 1.04 
each connected weight. Parameter-free tuning is the major advantage. In need ALAS 1.17 


the following, we provide a summary of some of these algorithms. = as i 
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14.7.1 Modified Backpropagation Algorithm 


Based on Equation 14.43, we have developed [1] a new error function using the linear and nonlinear 
neuronal outputs. 
Recall that the output error (called the nonlinear output error signal) is now given by 


el = dil — yl (14.45) 
Now, the linear output error signal can easily be found by 
el’! = Idi! — yl (14.46) 


where Id\s! is given by 


ld! = f"(d\") (14.47) 


This is illustrated in Figure 14.4. 
The proposed optimization criterion (14.44) is given by 


Applying the gradient descent method to E,, we obtain the following updating equations for the 
output layer [L] and the hidden layers [s] from (L—1) to 1, respectively: 


Awl) (k) = pf’ (ul )edislylh" +predtilyl (14.48) 
Aws! = Hyp f (up ely + HAyy leajp (14.49) 
where 
N41 
ef) = Sve oe aw (14.50) 
r=1 
yo 
yo 
[s-1] 


Ms-1 


[s] 
Y &3j 


FIGURE 14.4 Finding the nonlinear and the linear error signal in a neuron. 
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and 


Ms] 
afl = FURY eleven (14.51) 
r=1 
are assumed to be the estimates of the nonlinear and linear error signals for the hidden layers, respectively. 


14.7.2 Least Squares Algorithms for Neural Network Training 


The existence of the nonlinearity in the activation function makes the backpropagation algorithm a 
nonlinear one. If this nonlinearity can be avoided in one way or another, one can make use of all least- 
squares adaptive filtering techniques for solving this problem. These techniques are known to have rapid 
convergence properties. 

The development of a training algorithm using RLS methods for NNs was first introduced in [16] and 
then later extended in [15,17-19]. 

For the purpose of developing a system of linear equations, two approaches can be used. The first is 
to invert the sigmoidal output node function, as in the case of Figure 14.4 [19]. The second is the use of 
a standard threshold logic-type nonlinearity, as shown in Figure 14.5 [15]. 


14.7.2.1 Linearization by Nonlinearity Inversion 


Scalero et al. in [19] have proposed a new algorithm, which modifies weights based on the minimization 
of the MSE between the linear desired output and the actual linear output of a neuron. 
It is shown in Figure 14.4 that a neuron is formed by a linear part (a scalar product) anda nonlinear part, 
and then it is possible to separate the linear and nonlinear parts to derive a linear optimization problem. 
The optimization problem is then based on the following optimization criterion: 


k ny 
E(k) =7 eh (te) (t) (14.52) 


where p(k,f) is the variable weighting sequence which satisfies 


p(k,t) = Mk, p(k -Lt) (14.53) 
and p(k,k) = 1. 
This means that we may write 
k 
pt)=] [aw (14.54) 
jrthl 


w 
[sl] 2 
yo 
Is] 
ws 
J ji 
Is 1] is 
} [s] 
(=) % 
is 
[s-1] JMs-1 
Vins 


FIGURE 14.5 Standard logic-type nonlinearity. 
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In the most applications, a constant exponential weighting is used, ie., p(k,t) = A“ where A is a 
positive number, less than, but close to 1, which is called the “forgetting factor.” 

This error (14.52) can be minimized by taking partial derivatives of E(k) with respect to each weight 
and equating them to zero. The result will be a set of n, + 1 linear equations where n, + 1 is the number 
of weights in the neuron. 

Minimizing E(k) with respect to a weight wi produces 


w7}(k) = R'(k) p(k) (14.55) 
where 
k 
R= yor (14.56) 
p=l 
and 
k 
p(k) = Sidi i (14.57) 
p=l 


Except for a factor of 1/k, (14.56) and (14.57) are estimates of the correlation matrix and correlation 
vec tor, respectively, and they improve with increasing k. 
Both (14.56) and (14.57) can be written in a recursive form as 


R(k) =AR(k—1) + yey (14.58) 
and 
plk) =Ap(k-1) 4 Idjeyi (14.59) 


Although (14.58) and (14.59) are in a recursive form, what one needs is the recursive equation for 
the inverse autocorrelation matrix R“(k), as required by (14.55). This can be achieved by using 
either the matrix inversion lemma [37] or what may be viewed as its compact form, the Kalman 
filter [32-36]. 


14.7.2.2 Linearization by Using Standard Threshold Logic-Type Nonlinearity 


In this approach, the standard threshold logic-type nonlinearity shown in Figure 14.5 is used. 
The neuronal outputs will be expressed by the following system: 


0 si ul!) <0 
-ag : 
y= f(ul') = 7H) Yat. Den <a, (14.60) 
1 si ul 2a, 
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The optimization criterion at the iteration k which incorporates a limited memory is given by 
l k nL , 
E(k) = 5 Plt) (e/"(0) (14.61) 
t=1 j=l 


where el (t) = de - vie 

For the weight vectors, W}"|(k), in the final (output) layer, since the desired outputs are specified E,,(k) 
can be minimized for w}(k) by taking their partial derivatives with respect to wi"(k) and setting it 
equal to zero, thus 


OE,(k) _ 
awk > (14.62) 
This leads to the following deterministic, normal equation: 
R,(k)W}" = P,(k) (14.63) 
where 
k LY F 
R,(k)= ye & yey ea.@ (14.64) 
t=1 aL 
and 
. 1 
P(k)= Da nro (14.65) 
t=l aL 


This equation can be solved efficiently using the weighted RLS algorithm [37,38]. 

Note that when the total input to node j does not lie on the ramp region of the threshold logic non- 
linearity function, the derivative in (14.62) is always zero. This implies that the normal equation in 
(14.63) will only be solved when the input to the relevant nodes lies within the ramp region; otherwise 
no updating is required. This is the case for all the other layers. 

Similar normal equations for the other layers can be obtained by taking the partial derivatives of E,(k) 
with respect to the weight vectors in these layers and setting the results equal to zero. 

For the weight vector in layer [s], we have 


ae =0 (14.66) 
dw; (k) 
which leads, using the chain rule, to 
k ns ay! |(t) 
yan AT elt) =0 (14.67) 
t=1 i=0 dw, (k) 
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where 


1 As+1 
el\e)=— } wh Nadel) (14.68) 


As41 j=l 
by defining the hidden desired output of the jth neuron layer [s] as 


dt) = y(t) +e!(t) (14.69) 


The following normal equation can be obtained for the layer [s]: 


De poolan = yh whl(k) pe (14.70) 


t=1 : i 


The equations for updating the weight vectors wi""(k) ’s can be derived using the matrix inversion 
lemma as 


RT" (k-DY""(k) 


K"l(k) = “ — (14.71) 
A+YET (KR (k-DYE NK) 

RMR) = A(T KY) RN (k-1) VLE [LL] ee) 

Aw'(k) _ Keee{ a = — Yl yk -1) Vje [1 n, | (14.73) 


14.8 Kalman Filters for MLP Training 


The learning algorithm of an MLP can be regarded as parameter estimation for such a nonlinear system. 
A lot of estimation methods of general nonlinear systems have been reported so far. For linear dynamic 
systems with white input and observation noise, the Kalman algorithm [49] is known to be an optimum 
algorithm. 

In [35], the classical Kalman filter method has been proposed to train MLPs, and better results have 
been shown compared with the SBP algorithm. In order to work with a system of linear equations, an 
inversion of the output layer nonlinearity and an estimation of the desired output summation in the 
hidden layers are used as in Section 14.7.2. 

Extended versions of the Kalman filter algorithm can be applied to nonlinear dynamic systems by 
linearizing the system around the current estimate of the parameters. Although it is computationally 
complex, this algorithm updates parameters consistent with all previously seen data and usually con- 
verges in a few iterations. 


14.8.1 Multidimensional Kalman Filter Algorithm (FKF) 


To solve the nonlinear network problem, one should portion it into linear and nonlinear parts [19]. Since 
the fast Kalman filter is only applied to solve linear filtering problems, the same linearization approach 
as in [19] is considered. 
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Looking at Figure 14.1, the linear part of a neuron may be viewed as a multidimensional input 
linear filter. 

Therefore, the MLP training problem is transformed into a filtering problem by using fast Kalman 
filter [37]. 

The algorithm works as follows: 


e FKF procedure 
For each layer s from 1 to L, compute the following quantities: 


¢ Forward prediction error 


e° = y,4(k)— Al (k-Dy,a(k-1) (14.74) 


¢ Forward prediction matrix 
A(k)= A(k-1)+G,(k-0)(e(K)) (14.75) 
¢ A posterior forward prediction error 
es(k) = ysa(k)— As (k)ysa(k-1) (14.76) 
e Energy matrix 
E! (k) = ABS (k 1) +e? (k)(e%(k)) (14.77) 
e Augmented Kalman gain 


0 I, 
+] + |(E(&)' 8) 


GPR=| 
G.Ak-1| +4.) 


M.,(k) 
gS) bas (14.78) 
m,(k) 
¢ Backward prediction error 
es(k) = yoa(k-1)—Bi(k—-Ny alk) (14.79) 
¢ Kalman gain 
G.(k) = 1 __(M,(k) + B,(k —1)m,(k)) (14.80) 
1—(e2(k)) m.(k) 
¢ Backward prediction matrix 
B,(k) = B(k-)+G,(K)(e(0), (14.81) 
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It follows that the updating rules for the different synaptic weights are given as follows: 


¢ For the weight of each neuron j in the output layer L 
wi(k) = wH4l(k-1) +6, (k)(d\(k) — yh) (14.82) 
¢ For each hidden layer s = 1 through L— 1, the weight vectors are updated by 
wil(k) =wl(k-1)+G,(kel lu (14.83) 


where ell is given by (14.50). 


It should be noted that to avoid the inversion of the energy matrix E?(k) at each layer, we can use a 
recursive form of the inversion matrix lemma [50]. 
For the different steps of the algorithm, see Appendix 14.A. 


14.8.2 Extended Kalman Filter Algorithm 


The extended Kalman filter is well known as a state estimation method for a nonlinear system, and can 
be used as a parameter estimation method by augmenting the state with unknown parameters. Since 
the EKF-based algorithm approximately gives the minimum variance estimate of the link weights, it is 
expected that it converges in fewer iterations than the SBP algorithm. 

Since mathematical derivations for the EKF are widely available in the literature [51,52], we shall 
briefly outline the EKF applied to a discrete time system. 

Consider a nonlinear finite dimensional discrete time system of the form 
i +1)= fy(x(k)) + w(k) wer 

yk) = hy (x(k)) + v(k) 


where 
x(k) is the state vector 
y(k) is the observation vector 
fj, and h, are time-variant nonlinear functions 


Also, w(k) and v(k) are assumed to be zero mean, independent, Gaussian white noise vectors with known 
covariance Q(k) and R(k), respectively. 

The initial state x(0) is assumed to be a Gaussian random vector with mean x, and covariance P). 

Defining the current estimated state vector based on the observations up to time k — 1 as x(k/k — 1), 
the EKF updates the state vector as each new pattern is available. 

The final results are given by the following equations: 


i 1 ilk 
ie 2)=f (:(4)) (14.85) 
(4) 3 (4) +K(b) pwn (A) (14.86) 


_pf_&_\yr ke Yar bs 
Kwa ye cof new ye w+RW| (14.87) 
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afk 4 iF rape; Jere +Q(k) (14.88) 


k\_p(_k_ \_ pe, 
le bea] Keomor( (14.89) 


This algorithm is initialized by x(0/-1) = x, and P(0/-1) = P,. The matrix K(k) is the Kalman gain. 
F(k) and H(k) are defined by 


F(k) = (#] (14.90) 
ox x=£(k/k) 
H(k)= [2] (14.91) 
ox x=8(k/k-L) 


The standard Kalman filter for the linear system, in which f,(x(k)) = A,x(k) and h,(x(k)) = B,x(k) gives 
the minimum variance estimate of x(k). In other words, x(k/k — 1) is optimal in the sense that the trace 
of the error covariance defined by 


Ailefo-lboGty]f os 


is minimized, where E(.) denotes here the expectation operator. On the other hand, the EKF for the 
nonlinear system is no longer optimal, and x(k/k — 1) and P(k/k — 1) express approximate condi- 
tional mean and covariance, respectively. Because the EKF is based on the linearization of f,(x(k)) 
and h,(x(k)) around x(k/k) and x(k/k — 1), respectively, and on the use of the standard Kalman filter, 
it is also noted that the EKF may get stuck at a local minimum if the initial estimates are not appro- 
priate [54]. Nevertheless, a lot of successful applications have been reported because of its excellent 
convergence properties. 

We will show now how a real-time learning algorithm for the MLP can be derived from the EKF 
[32,53]. Since the EKF is a method of estimating the state vector, we shall put the unknown linkweights 
as the state vector 


mM=[w)y',(w?)",....W4)? | (14.93) 


The MLP is then expressed by the following nonlinear system equations: 


M(k+1)= M(k) 
d(k) = h,(M(k)) + v(k) (14.94) 
= y"(k) + v(k) 


The input to the MLP for a pattern k combined with the structure of the MLP is expressed by a nonlinear 
time-variant function h,. The observation vector is expressed by the desired output vector d(t), and v(k) 
is assumed to be a white noise vector with covariance matrix R(k) regarded as a modeling error. 
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The application of the EKF to this system gives the following real-time learning algorithm 


M(k) = M(k-1)+ K(k)[d(k) —y(M(k-1) | (14.95) 
K(k) = P(k-1)H" (k)| H(k)P(k -1)H"(k)+ Rw] (14.96) 
P(k) = P(k-1) —K(k)H(k)P(k-1) (14.97) 


We note that we put here P(t) = P(t/t) and M(k) = M(k/k) since P(t) = P(t + 1/t) and M(k/k) = M(k + 1/k). 
Also y#4I(k) denotes the estimate of y!4I(k) based on the observations up to time k — 1, which is com- 
puted by yl(k) = h,(M(k - 1). According to (14.91), H(k) is expressed by 


é dY*(k) 
Hw=( aw ] (14.98) 
M=M(k-1) 
H(k) =[Hi(k),..-, Ay, (k), AP (k),..-s Hn (k), Hi (k),..s Ak, (K)] (14.99) 
with the definition of 
dY*(k) 
Hi(k)= - (14.100) 
| aw; i 


The different steps of the algorithm are given in Appendix 14.B. 


14.9 Davidon-Fletcher—Powell Algorithms 


In [41,42], a quasi- Newton method called Broyden, Fletcher Goldfarb, and Shanno (BFGS) method have 
been applied to train the MLP and it’s found that this algorithm converges much faster than the SBP 
algorithm, but the potential drawback of the BFGS method lies on the huge size of memory needed to 
store the Hessian matrix. In [7], the Marquardt-Levenberg algorithm was applied to train the feedfor- 
ward neural network and simulation results on some problems showed that the algorithm is very faster 
than the conjugate gradient algorithm and the variable learning rate algorithm. The great drawback 
of this algorithm is its high computational complexity and its sensitivity for the initial choice of the 
parameter Ul in the update of the Hessian matrix. 

In this part, we present a new fast training algorithm based on the Davidon-Fletcher—Powell (DFP) 
method. The DFP algorithm consists of approximating the Hessian matrix (as is the case of all quasi- 
Newton methods). 

The new Hessian matrix is approximated also by using only the gradient vector VE(W) as in ML 
method but the approximation in this case uses more information provided by the descent direction d, 
and the step length A obtained by minimizing the cost function E(W + Ad,) that governs the amplitude 
of descent and updating. 

Let H(k) be the inverse of the Hessian matrix. The DFP algorithm is based on updating H(A) iteratively by 


B(k)3"(k) — -A(k)y(k)y" (H(A) 
SiAy(k) (KH (K(k) 


H(k+1)=H(k)+ (14.101) 


where 5(k) and y(k) are the different parameters used in DFP algorithm. 
The DFP algorithm for mathematical programming works as indicated in Appendix 14.C. 
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14.9.1 Davidon—Fletcher—Powell Algorithm for Training MLP 


In this section, we apply the DFP algorithm first to train a single output layer perceptron, and then we 
extend all equations to MLP. 


14.9.1.1 DFP Algorithm for Training a Single Output Layer Perceptron 


First, let us develop the new algorithm for a neuron j in the output layer [L] of the network. The optimi- 
zation criterion E (w}") for the current pattern is defined as 


iT 2 
[L]) _ [L] 
E(w] )=(4 ) (14.102) 


where e! ej “lis the neuron output error defined by Equation 14.4. Note that the difficult task in the applica- 
tion of the DFP algorithm is how to find the optimal value of A that minimizes E(W, (w’ I + Nd if), 


The first method to search A that minimize E(w, Wis Ad,\" Vi is to solve the following equation: 


dE(W}"! +Ad\") 


=v 14.103 
on ( ) 
Recall that 

diy i(k) = -HY(KVE(Wy"l(k)) (14.104) 

For this purpose, we have to derivate E with respect to i. 

dE(WI + Ads") 1 de7(WI"+Ad}"") 
OA 2 On 

=2f" (w! wil + Ad tt! )(e(w} wil + rd, ')- alil\(y\ [L- ay d'ti(k) (14.105) 


E(w} + Adl}) is minimum when (az (Ww, Wil + dij i) )) (or) = 0, which leads to the optimal value of A: 


1-di) 
ec( 58 Jeonsy wji"(k) 
xt a d; 


We can apply this search method of A* in the case of a single-layer perceptron but this becomes a very 
difficult task when we going up to train an MLP. 

To avoid this hard computation to find an optimal step length A*(k) for a given descent direction d,(k), 
we can use Wolfe’s line search [41,42]. 

This line search procedure satisfies the Wolfe linear search conditions: 


(14.106) 


E(w(k) + A(k)d, (k)) — E(w(k)) $10~*A(K)d,” (kK) VE(w(k)) (14.107) 


d: VE(w(k) + A(k)d(k)) = 0.9d! VE(w(k)) (14.108) 


In Appendixes 14.D and 14.E, we give the different steps of Wolfe’s line search method. 
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Before extending this method to train an MLP, let us give a practical example for a single neuron 
trained with ML, and the two versions of DFP algorithm. 


14.9.1.2 DFP Algorithm for Training an MLP 


In the previous section, we have applied the new algorithm to a single-layer perceptron with tow tech- 


niques to find a step length A. 


TABLE 14.5 Performance 


To train an MLP, considered with a single hidden layer, we need to 
minimize the error function defined by 


Comparison of the Fast 
Algorithms with respect eg - 2 
to the SBP One for the 4-b = a7ltl? . [0] (2] 

Ew)=5"| d,— >| (w! Ww 14.109 
Parity Checker (w) ‘ i-f fw y jr ( ) 

— r= 
S,, (%) G, (%) 

and = 97 The problem here is how to determine the weights of the network that 
he 36 ” minimize the error function E(w). This is can be considered as an uncon- 
ini iad as strained optimization problem [43]. For this reason, we will extend the 
EKF 52 98 


equations of the DFP method on the entire network to optimize the cost 


ne function presented in (14.109). 


The learning procedure with DFP algo- 
rithm needs to find an optimal value 2”, 


TABLE 14.6 Improvement 
Ratios with respect to SBP 


which minimizes the function E(w + Ad,). InIteration In Time 
To evaluate the coefficient 4, we will use the Wolfe line search because pp 2.8 2.6 
the exact method is very difficult since we must drive E(w + Ad,) with xs 3.4 27 
respect to A. FKE 3.3 2.1 
When we have applied the DFP algorithm to train a neural network —EKF 3.4 23 
with a single hidden layer, we have found that this algorithmisfasterthan DEP 3.5 2.6 


the ML one. Different steps of the algorithm are given in Appendix 14.E. 


14.10 Some Simulation Results 


In this section, we present some simulation results of the last five algo- 


rithms. Some of them are based on the Newton method. To perform a 
good comparison, we have used the same problems as in Section 14.6: the 
4-b parity checker and the circle-in-the-square problem. The compari- 
son is done by the comparison of S,,, G,, and the improvement ratios and 
always with respect to the performance of the SBP algorithm. 


14.10.1 For the 4-b Parity Checker 


Tables 14.5 and 14.6 present the different simulation results for this problem. 


TABLE 14.7 Performance 
Comparison of the Fast 
Algorithms with respect 


to the SBP One for the 


Circle-in-the-Square Problem 


Sy (%) G,(%) 
MBP 55 99 
RLS 55 98 
FKE 52 99 
EKF 52 99 
DFP 50 99 


TABLE 14.8 Improvement 


Ratios with respect to SBP 


14.10.2 For the Circle-in-the-Square Problem In Iteration _In Time 
Tables 14.7 and 14.8 present the different simulation results for this problem. MEP a a 
From these results, one can conclude that the performance of these fast i in =~ 
algorithms is quite similar; the choice of the best algorithm for training cea ad a 
an MLP is still always dependent on the problem, the MLP structure, and ol a ail 
3.4 25 


the expertise of the user. 
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14.11 Conclusion 


This work constitutes a brief survey related to the different techniques and algorithms that have been 
proposed in the literature to speed up the learning phase in an MLP. We have shown that the learning 
speed depends on several factors; some are related to the network and its use and others are related to 
the algorithm itself. 

Despite the variety of the proposed methods, one isn’t capable of giving a forward confirmation on the 
best algorithm that is suitable in a given application. However, the algorithms based on the SOOM are 
more rapid than those based on the gradient method. Recently developed NBN algorithm [39] described 
in Chapter 13 is very fast but its success rate is 100% not only for the provided here benchmark of 
Parity-4 but also for the Parity-5 and Parity-6 problems using the same MLP architecture (N,8,2,1) [55]. 


Appendix 14.A: Different Steps of the FKF Algorithm 
for Training an MLP 


1. Initialization 
+ From layer s=1 to L, equalize all yt’ to a value different from 0 (e.g., 0.5). 
¢ Randomize all the weights wii! at random values between +0.5. 
e Initialize the matrix inverse E?(0). 
¢ Initialize the forgetting factor A. 
2. Select training pattern 
¢ Select an input/output pair to be processed into the network. 
«The input vector is y})) and corresponding output is dy 
3. Run selected pattern through the network for each layer s from 1 to L and calculate the summa- 
tion output: 


Ns—1 
fees [s-1 
a yeas 
i=0 


and the nonlinear output 
Vi. = f (ul?) = sigmoide( u'"!) 


4. FKF procedure 
For each layer s from 1 to L, compute the following quantities: 
¢ Forward prediction error 


es = ysalk)— As (k-Dy.alk -1) 


¢ Forward prediction matrix 
E 
A.(k) = As(k-1) + G,(k-1)(eS(k)) 
« A posterior forward prediction error 


es(k) = yolk) — As (k)ysa(k-1) 


e Energy matrix 


Ei(k) = AES(k—1) + e%(K)(e%(K)) 
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« Augmented Kalman gain 


0 I, 
GP@=| L |+] 7 |(e@) e&t® 
Glk-D] 4k) 


M,(k) 
L 
m,(k) 


I 


¢ Backward prediction error 


es (k) = y.alk-1)—Bi(k yak) 
« Kalman gain 
1 
1=(e8(), m.(00) 


[M.(k) + B.(k—1)m,(k)] 


G.(k) = 


¢ Backward prediction matrix 
T 
B.(k) = B,(k-1) + G,(k)(e°(k)) 


5. Backpropagate the signal error 
» Compute f(ul") = f(u}")(1- ful")) 
¢ Compute the error signal of the output layer (s = L): ee =f ‘ul (a - yi) 
¢ For each node j of the hidden layer, start from s = L—-1 downs = 1 and calculate 


(py [s4l],, [s+] 
e; = fu; Sy (el wi!) 
i 


6. Calculate the desired summation output at the Lth layer using the inverse of the sigmoid 


for each neuron j. 
7. Calculate the weight vectors in the output layer L 


wi!l(K) = whl k= 1) + Gr(k)(d (kK) — y() 


for each neuron j. 
For each hidden layer s = 1 through L“, the weight vectors are updated by 


wil(k) =wi"(k-1)+G, (Kel 


for each neuron j. 
8. Test for ending the running 


© 2011 by Taylor and Francis Group, LLC 


14-28 Intelligent Systems 


Appendix 14.B: Different Steps of the EKF for Training an MLP 


For k = 1,2,... 
j= fWHK-1'9) 
Mk) = Mk-1) + wk) 
(aCk) = yK))"A(R) = WR) _ 44) 
ny 
Ji (k) = X"(k) 
For s = Lat 1: 


j 
yk) (L= y!""(k))[0,...,0,1,0,...,0]" 
A\(k) = fs 
FW(1- HW) Y wi"k-DA"&) 


j=l 
Wilk) = Pj (kD y*(k) 
065(k) = y i(k)! i(k) 


Bi(k)=As"A; 


Aj(k)! (d(k) — y5(k)) 
Mk) + 0) (k)Bi(K) 


W; (k) = Wj (k-1) + Wilk) 


S(k : 
Pi(k) = PK(k=1) ub. 8 Vivi 
IPI 


yjsalk) = yi(k) + AG(K)y (K)(y i(k) — yi(k-D) 


Appendix 14.C: Different Steps of the DFP Algorithm 
for Mathematical Programming 


1. Initializing the vector W(0) and 
a positive definite initial Hessian inverse matrix H(0). 
Select a convergence threshold: ct 

2. Compute the descent direction d,(k) 


d,(k) =—H(k)VE(W(k)) 
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3. Search the optimal value A*(k) such as 
E(W(K) +" (kd, (k)) = min ECW(K) + MK), (4) 
4. Update W(k): 
W(k+1) =W(k)+A*(k)d,(k) 
5. Compute 
3(k) =W(k +) — Wk) 


y(k) = VE(W(k +1))-VE(W(h) 


_ 8(K)8'(k) A (kK) y(k)y" (K(k) 


A(k) = = 7 
B (kk) (A) (Kk) y(k) 


6. Update the inverse matrix H(k): 
H(k +1)= H(k) + A(k) 


7. Compute the cost function value E(W(k)) 
If E(W(k) > ct. Go to step 2. 


Appendix 14.D: Different Steps of Wolfe’s Line Search Algorithm 


For a given descent direction d,(k), we will evaluate an optimal value * that satisfies the two Wolfe’s 
conditions. 


1. Set A, = 0; choose A, > Oand X,,,,32= 1 
Repeat 
2. Evaluate E(w + A,d,) = ®(A,) and check the 1st Wolfe condition: 
If B(A,) (0) + 10-4A,@(0) -d, 
Then A* < Zoom (A,_,,A,) and stop 
3. Evaluate ®’(A,)d, and check the 2nd Wolfe condition: 
If |®’(A,) -d,| < 0.9B(0) - d, 
Then A* =A, and stop 
If M’(A,)-d,=0 
Then A* © Zoom (A;A,_,) and stop 
4. Choose i,,; 0 (AjsAmax) 
i= i+; 
End (repeat) 


“Zoom” phase 
Repeat 
. Interpolate to find a trial step length A, between A,, and A, 
2. Evaluate ®(1,) 
If [B(A,) > BO) + 10-“A,O(0) - d,] or [B(A,) > B(A,,)] 


e 
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Then A, = Aj 

(A, = (A) 

Else, compute (A) -d, 

If |®(A,) - d,| < 0.9'(0) - d, 
Then A* = i, and stop 

If O'(A)d, 7 Oy: = Aro) 20 
Then A, = Ay, 

D(A) = P(A.) 

Mo i= uy 

O(1,,) = P(A) 


. Evaluate (A) 


End (repeat) 
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Appendix 14.E: Different Steps of the DFP Algorithm for Training an MLP 


1. 


N 


© 2011 by Taylor 


Initializing randomly the synaptic coefficients w(0) and the definite positive Hessian inverse 


matrix H(0). 


. Select an input/output pattern and compute 


T 
u=(Wi) yf and ful!) = yj" 


. Compute the error function: 


E(w) = ye = (a? an yy 
fl j=l 


. Evaluate VE(w) = (0E(w))/(dw) where w= (WE-4, WH!) 
. Compute the descent direction d,(k) 


d,(k) = —H(k)VE(w) 


. Compute the optimal value (*(k) that satisfies the two Wolfe’s conditions: 


E(w(k) + A(k)d,,(k)) — E(w(k)) $10~*A(k)d, (k)V E(w(k)) 


di VE(w(k) + A(k)d(k)) = 0.9; VE(w(k)) 


. Update w(K) for the output layer: 


w(k +1) = w(k) + A" (k)d,(k) 


. Compute: 


8(k) = w(k +1) — w(k) 
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yk) = VE(w(k +1))-VE(w(k)) 


_ 5(K)8"(k) _ A (k)y(k)y" (K)H(k) 
S'(kyk) (KH (kK) y(k) 


A(k) 


9. Update the inverse matrix H(k): 


H(k +1) =H(k)+A(k) 


10. Compute the global error: E = E, 
P 


If E > threshold then return to 2. 
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15.1 Introduction 


Since the 1990s, the feedforward neural network (FNN) has been universally used to model and sim- 
ulate complex nonlinear problems, including the supposedly unknown mathematical relationship 
existing between the input and output data of applications. The implementation of an FNN requires 
the implementation of the different steps below to ensure proper representation of the function to be 
approximated: 


Step 1: Identification of the structure of the multilayer neural network: 

First, we must choose the network architecture, i.e., the neural network structure, by determining the 
number of layers and the number of neurons per layer required. To date, the choice remains arbitrary 
and intuitive, which makes the identification of the structure a fundamental problem to be solved, and 
has a huge effect on the remaining steps. The work of Funahashi (1989) and Cybenko (1989) shows that 
any continuous function can be approximated by an FNN with three layers, using a sigmoid activation 
function for neurons in the hidden layer, and linear activation functions for neurons in the output layer. 
This work shows that there is a neural network to approximate a nonlinear function, but does not specify 


15-1 


© 2011 by Taylor and Francis Group, LLC 


15-2 Intelligent Systems 


the number of neurons in the hidden layer. The number of existing neurons in the input layer and output 
layer is fixed by the number of inputs and outputs, respectively, of the system to be modeled. The power 
of neural network architectures strongly depends on the used architecture [WY10]. For example using 
10 neurons in popular MLP architecture with one hidden layer only Parity-9 problem can be solved 
[WHMO03,W 10]. However if FCC (Fully connected architecture) is used then as big problem as Parity 
1023 can be solved with the same 10 neurons. Generally, if connections across layers are allowed then 
neural networks are becoming more powerful [W09]. Similarly, the choice of the number of neurons per 
layer is another important problem: the choices of the initial synaptic coefficients are arbitrary, so there 
must be an effective and convergent learning phase. 


Step 2: Learning phase: 

Second, this learning phase consists of updating the parameters of nonlinear regression by minimiz- 
ing a cost function applied to all the data, so that the network achieves the desired function. There are 
many learning methods which depend on several factors including the choice of cost function, the ini- 
tialization of the weights, the criterion for stopping the learning, etc. (For further details see the related 
chapters in this handbook.) 


Step 3: Generalization phase or testing phase: 

Finally, we must test the quality of the network obtained by presenting examples not used in the learn- 
ing phase. To do so, it is necessary to divide the available data into a set of learning patterns and another 
of generalization patterns. At this stage, we decide whether the neural network obtained is capable 
of achieving the desired function within an acceptable error; if it is not, we need to repeat the steps 1, 2, 
and 3 by changing one (or more) of the following: 


¢ The structure 
¢ ‘The initial synaptic coefficients 
¢ The parameters of the learning phase: the learning algorithm, stopping criteria, etc. 


In general, the advantages of a multilayer neural network can be summed up in its adjustable synaptic 
coefficients, and a feedback propagation learning algorithm, which is trained on the data. The effective- 
ness of the latter depends on the complex structure of the neural network used. This makes the struc- 
tural identification stage of “optimizing the number of layers necessary in a multilayer neural network 
as well as the number of neurons per layer, in order to improve its performance” an essential step for 
guaranteeing the best training and generalization. To this end, much research has been devoted to solve 
this problem of choosing the optimal structure for a multilayer neural network, from the point of view 
of the number of layers and the number of neurons per layer: such as the pruning and growing algo- 
rithms. In this chapter, we will focus on the pruning algorithms. 


15.2 Definition of Pruning Algorithms 


Commencing with a multilayer neural network of large structure, the task for the pruning algorithm is 
to optimize the number of layers and the number of neurons needed to model the desired function or 
application. After pruning, the FNN retains the necessary number of layers, and numbers of neurons in 
each layer to implement the application. The resulting FNN is regarded as having an optimal structure. 


15.3 Review of the Literature 


An FEN should ideally be optimized to have a small and compact structure with good learning and 
generalization capabilities. Many researchers have proposed pruning algorithms to reduce the network 
size. These algorithms are mainly based on 


¢ The Iterative Pruning Algorithm [CF97] 
¢ Statistical methods [CGGMM95] 
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¢ The combined statistical stepwise and iterative neural network pruning algorithm [FFJC09] 


¢ Pruning heuristics using sensitivity analysis [E01] 

¢ Nonlinear ARMAX models [FFNO1] 

« Adaptive training and pruning in feed-forward networks [WCSLO1] 
¢ Optimal brain damage [LDS90] 

¢ Pruning algorithm based on genetic algorithm [MHF06] 

¢ Pruning algorithm based on fuzzy logic [J00] 


In this chapter, two existing, published techniques used in pruning algorithms are reviewed. A new 
pruning algorithm proposed by the authors in [FFJC09] will also be described and discussed. 


15.4 First Method: Iterative-Pruning (IP) Algorithm 


This method was proposed first by Castellano et al. [CF97] in 1997. The unnecessary neurons in a large 
FNN of arbitrary structure and parameters are removed, yielding a less complex neural network with a 
better performance. In what follows, we give some details to help the user to understand the procedure 
to implement the IP algorithm. 


15.4.1 Some Definitions and Notations 


An FNN can be represented by the following graph, N = (V, E, w), this notation is known as an “acyclic 
weighted directed graph” as reported in [CF97] with 


V= {l, 2, ..., n}: Set of n neurons 
Ec V*V: The set of connections between the different neurons V 
w:E — R: The function that associates a real value Wii for each connection (i,j) € E 


Each neuron ié V is associated with two specific sets: 
¢ The own set of “Projective Field”: The set of neurons j fed by the neuron i. 


P={jeVv:(,ie EF} (15.1) 
« The own set of “Receptive field”: The set of neurons j directed to neuron i. 


R={jeV:@ je E} (15.2) 


We define by p; and 1; the cardinals respectively of the sets P, and R;. 
In the case of a multilayer neural network, the projective and receptive sets of any neuron i belong- 
ing to layer / are simply the sets of neurons in layers (/ + 1) and (1 - 1), respectively. 
The set of neurons V can be divided into three subsets: 

e V;: The input set neurons of the neural network 

¢ Vo: The output set neurons of the neural network 

e V,; The set of hidden neurons of the neural network 
A multilayer neural network works as follows: each neuron receives input information from the 
external environment as a pattern, and it spreads to the neurons belonging to the projective sets. 
Similarly, each neuronieé V;,U Vo receives its own receptive set and its input is calculated as follows: 


ps yy; (15.3) 
JER 


where 
u; is the linear output of neuron i 
y;,Yepresents the nonlinear output of the neuron j 
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FIGURE 15.1 An example of a multilayer neural network: V, = {1, 2, 3, 4}, V,, = {5, 6, 7}. The own projective set of 
neuron 7 is P, = {8, 9}. 


This neuron then sends the nonlinear output signal y, to all the neurons belonging to its own projec- 
tive set P;: 


Nn =f(u) (15.4) 


where f(.) is the activation function of the neuron i. 

This procedure continues until the output neurons correspond to the input example. 

Figure 15.1 shows an example of a multilayer neural network and illustrates the notations introduced 
above. 


15.4.2 Formulation of the Pruning Problem 


The pruning procedure (IP) consists of a series of steps of elimination of neurons in the hidden layer of 
the FNN. The elimination of a neuron directly implies the elimination of all connections associated with 
this neuron. So, the main question which arises is how is the neuron to be eliminated chosen? 

Suppose that a neuron h was chosen for elimination according to a pre-specified criterion, which will 
be detailed further in the following paragraph. Consider the new set of neural network connections: 


Enew = Ena —({h} * P, UR, * {h}) (15.5) 


with E,,, the set of connections before the elimination of the neuron h. 
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This elimination is followed directly by an adjustment of all the connections related to each neuron 
belonging to all projective P,, in order to improve (or maintain) the performance of the initial neural 
network during the learning phase. 

Consider the neuron i¢€ P,,. The corresponding linear output of the neuron for a pattern p € {1, ...,. M} 
is given by 


ui? = S wy” (15.6) 
jem 


where ¥”’ is the nonlinear output neuron j corresponding to the pattern p. 

After removing the neuron h, one has to update the remaining synaptic coefficients w, in the same 
layer by a specified value 6 in order to maintain the initial performance of the neural network. This idea 
is illustrated in Figure 15.2. 

The linear summation at the output of the pruned layer becomes 


SY way = SY (y+ 3yyi? (15.7) 


jeR, jeRi-{h} 


with 
p= 1...M, ie P,, 
5, is the adjustment factor to be determined later. 


A simple mathematical development gives 


>, S07? = ways? (15.8) 
JeR—{h} 


Outputs Outputs 


. 


’ 
Wh7 \, / Wha 


FIGURE 15.2 The procedure of the IP algorithm. The neuron h was chosen for elimination. All connections 
associated with this neuron are eliminated and all related connections P,, are adjusted with the parameter 


$ = [857 55 867 S<s]- 
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This leads to a set of Mp, linear equations with k;, = » (7, -— 1) unknown parameters {5;}. Similarly, 
id 


1 
we can observe that k,, represents the total number of connections on their way to the set P,,. It is interesting 
to write these equations in a matrix form. Thus, consider the following vectors: 


¢ Foreachieé P,, the M vector y; contains all values of the output neuron i for each example p: 
FW ry (15.9) 
¢ Define Y;,,: the (M*(r,— 1)) matrix whose columns are the new releases of neurons j € R; — {h}: 
YAP eo Peal (15.10) 
with the index jk, for k = 1, ..., (r;— 1) belonging to R; — {h}. 
Now solve the system in its matrix form: 
Yin8; = Zin (15.11) 
For each ieé P,, with 5, the unknown parameters, and 


Zin = WinVn (15.12) 


Finally, consider all the set of equations, one gets 


Y,0 =Z, (15.13) 
with 
Y, = diag (Vina Yi2,no--Yiksne+> Vipny ) (15.14) 
SB =(010ncOncnOm) (15.15) 
Zh = (Zin > Zip h yeep Zich wea eos yr (15.16) 


Here, indexed ik(k = 1, ..., p,) vary in P,,. 
One can conclude by solving this set of equations by minimizing the following criterion: 


Minimize |[z, — Y,,5|, (15.17) 


Several methods can be used to solve this optimization problem. These include the conjugate gradient 
preconditioned normal equation (CGPCNE) algorithm (see Appendix 15.A). 
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15.4.3 How to Choose the Neuron to Be Removed? 


The adopted strategy is to associate each neuron in the hidden layer with an indicator giving its con- 
tribution to the neural network. After solving Equation 15.17, the choice of a neuron h must ensure a 
minimum effect on the outputs of the neural network. 

The CGPCNE method has an objective to reduce the next term at each iteration: 


Pn(5,) = |Z — Yndul, (15.18) 


where ||.||, is the Euclidian norm. 
But the choice of the neuron h is made before applying the algorithm CGPCNE. So, this choice is 
made at iteration k = 0: 


Pn(5o) =||Zn— Yndol, (15.19) 


Therefore, the neuron h can be chosen according to the following criterion: 


h= argminp,(5») (15.20) 


heVy 


where V,, is the set of hidden neurons in the neural network. 
In general, the initial value of 6, is typically set to zero. A general formulation of the criterion may 
be written as 


h=argmin ) Whi 


icP, 


yn (15.21) 
2 


A summary of the iterative-pruning (IP) algorithm is given in Table 15.1. 


TABLE 15.1 Summary of the IP Algorithm 


Step 0: Choose an oversized NN(k = 0) then apply the standard back propagation (SBP) training algorithm 
Repeat 
Step 1: Identify excess unit h from network NN(k): 


h= argmin )) wall) eet (hY de i, (yeoy’ 


ieP, 


where 
P,, represents the set of units that are fed by unit h (called projective field) 
w,,; represents weights connected from neuron h to neuron i 


yj represents the output of unit h corresponding to patterns pt € {1,...,M} 


Step 2: Apply the CGPCNE algorithm to determine 5 (refer to appendix) 


Step 3: Remove the unit h with all its incoming and outgoing connections and build a new network 
NN(k + 1) as follows: 
Wilk) + 8 ji ifie B, 


wotken=| 


Step 4:k:=k+1 


continue a until deterioration performance of NN(k) appears 
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15.5 Second Method: Statistical Stepwise 
Method (SSM) Algorithm 


Unlike the previous approach, Cottrel et al. [CGGMM95] proposed another algorithm that seeks the 
unknown synaptic coefficients, which must be removed to ensure the neural network performs better. 
This idea is based on statistical studies such as the Akaike information criterion (AIC) and the Bayesian 
information criterion (BIC). 


15.5.1 Some Definitions and Notations 


A neural network can be defined by the existing set of non-zero synaptic coefficients. Therefore we can 
associate each neural network (NN) model with its set of synaptic coefficients. In general, the quality of 
the neural network is evaluated by calculating its performance, such as by the quadratic error S(NN): 


S(NN) = CF =)" (15.22) 


p=l 


where 
NN indicate the set of all synaptic coefficients of the initial starting neural network, ice., 


NN= {wos WY such 
Vp Vp are, respectively, the desired output and current actual output of the neural network for an 
input example p 


This criterion may be sufficient if one is interested only in the examples used during the learning 
phase. In this case, one can use the following information criteria 


. AIC: 
ac= (15.23) 
M = M 
+ Be 
nici eM (15.24) 
M 
where 


m is the total number of all synaptic coefficients in the neural network 
M is the total number of examples used in the learning phase 


‘These criteria give good results when M is large and tends to infinity. 


15.5.2 General Idea 


The “Statistical Stepwise method” proposed by Cottrell may be considered to successively eliminate the 
synaptic coefficients w;, W1, ....Wi,, Which is equivalent to the successive implementation of the following 
neural networks: 


NN, (withw, =0) NN), (withw, =w;, =0), andsoon. 


Figure 15.3 shows an example of removing synaptic connection in a couple of iterations. 
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Outputs Outputs 


Inputs 


Starting structure of a Iteration N°1: the NN after Iteration N°2: the NN after 
feedforward neural network NN removing the synaptic connection removing the synaptic connection 


{Wyzt {Wrz Wag} 


FIGURE 15.3. Example of an initial NN and the removal process using the SSM algorithm over two iterations. 


15.5.3 Summary of the Steps in the SSM Algorithm 


After training an initially large neural network NN, one can apply the following steps to eliminate the 
insignificant synaptic coefficients in order to ensure better performance (Table 15.2). 


15.6 Third Method: Combined Statistical Stepwise and 
Iterative Neural Network Pruning (SSIP) Algorithm [FFJC09] 


The algorithms IP and SSIP are used to simultaneously remove unnecessary neurons or weight connec- 
tions from a given FNN NN in order to “optimize” its structure. Some modifications to the previous 
pruning algorithms published in [CF97] and [CGGMM95] can be reported. As indicated in the previ- 
ous pruning algorithms IP and SSM, the stop criterion used in the pruning phase is aimed at finishing 
the removal process when the first deterioration occurs in the generalization step. However, due to the 
complexity and nonlinear character of the neural network, if the pruning process is continued without 
stopping the algorithm when the first deterioration occurs, another pruned network, which gives a bet- 
ter performance, may be formed. Consequently, the pruning was extended beyond the first deteriora- 
tion to determine any network which yielded a better generalization. In the next section, we shall start 
to add some additional modifications to improve the pruning capabilities of the standard IP and SSM 
algorithms, and consequently modified IP and modified SSM algorithms are defined as follows: 


1. Modified IP algorithm 
The modified IP algorithm for the elimination of one neuron is 


« Apply the steps (1, 2, and 3) of (IP) for the elimination of one neuron. 

« Compute the performance of the new NN after removal of the neuron. 

+ Ifthe performance of the new NNis better than that of NNoptimai(k), save the new optimal structure. 
« Decrease the number of neurons. 
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TABLE 15.2 Summary of the SSM Algorithm 


Step 0: Choose an oversized NN, then apply the SBP training algorithm. 
Repeat 


Step 1: For each weight connection w; = w,, compute the following ratio: 


ip 


_ [2Ms, 
Qi = | SaNN) 
M-m 
where 
1 
Sj = Tyg (SENN ij) — S(NN )) is the saliency of weight w, is the increase of the residual error resulting from the 


elimination of Ww; 


M A 
S(NN) = y a (Y, —Yyw)* is the sum of squared residuals of NN 


M A 
S(NNj) = y ie (% — Yun, y° the sum of squared residuals of NN; (without the weight w,) 


M is the number of training patterns 
Yyy is the output of the NN model 
m is the total number of remaining weights (before the elimination of w,) 


Y, is the desired output 


Yyni is the NN; output with weight w, removed 


Step 2: Determine |, 


min 


corresponding to the minimum value of these ratio 


Ww, w; =argminQ; =arg min 


min 


Step 3: If Q,,. < T (with 1 < t < 1.96 [CF97]), one can 
¢ Eliminate the weight w,,, corresponding to /,,,, 
¢ Retrain the new NN = NN, with SBP 
Else, do not eliminate the weight w,, ,. 


Until the performance of the NN deteriorates or no insignificant weights remain 


2. Modified SSM algorithm 
The modified SSM algorithm for the elimination of one weight connection is 


¢ Apply the steps (1, 2, and 3) of SSM algorithm for the elimination of one weight. 
¢ Ifthe weight is to be eliminated 

a. Compute the performance of the new NN after weight removal. 

b. Ifthe performance of the new NN is better than that of NN, 


ptimalik), Save the new optimal 


structure. 
c. Decrease the number of weights. 


With these modifications, the pruning algorithm SSIP is based on the idea of successive elimination 
of unnecessary neurons and of insignificant weights. It may be regarded as a combination of the mod- 
ified IP algorithm (to remove an excess neuron) and the modified statistical stepwise algorithm (to 
prune insignificant links). Two versions of the new pruning algorithm SSIP, namely SSIP, and SSIP,, 
are proposed. 
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TABLE 15.3 First Version of the New Pruning Algorithm: SSIP, 


Step 0: Choose an oversized NN then apply the SBP training algorithm. 
Step 1: 
Comments: Apply the modified IP algorithm (removal of neuron) to all the NN 
Repeat 
Apply the modified IP algorithm to remove one insignificant neuron from all the NN 
Until 
One neuron remains in each layer 
Conclusion: We have the NNip) = NN oprimai(k) after removing the unnecessary neurons from NN 
Step 2: 
Comments: Apply the modified SSM algorithm (removal of weight) to all the NNup) 
Repeat 
Apply the modified SSM algorithm to eliminate one insignificant weight from all the NNup) 
the latest optimal structure 
Until 
There are no more insignificant weights to be removed or there is only at least one weight between 
any two layers 


Conclusion: The final optimal structure obtained is labeled: NNooprimai = NN ssi) 


15.6.1 First Version: SSIP, 


The modified IP algorithm is firstly applied to remove insignificant neurons, and then the modified SSM 
algorithm is applied to remove any insignificant weight. 
This version SSIP, is given in Table 15.3. 


15.6.2 Second Version: SSIP, 


The two modified algorithms are applied separately to each layer, while retaining the previously 
optimized structure in each previous layer. The general algorithmic steps are given in Table 15.4. 


TABLE 15.4 Second Version of the New Pruning Algorithm: SSIP, 


Step 0: Choose an oversized NN(L layers), then apply the SBP training algorithm 
Step 1: Start from the input layer 
Step 2: 
Comments: Apply the modified IP algorithm (removal of neuron) only to this layer 
Repeat 
Apply the modified IP algorithm to remove only one neuron from this layer 
Until 
All insignificant nodes have been removed or there is only one neuron in this layer 
Comments: Apply the modified SSM algorithm (removal of weight) only to this layer 
Repeat 
Apply the modified SSM algorithm to eliminate one insignificant weight from this layer 
Until 
No insignificant weights are left or there is only one weight in this layer 
Step 3: Go on to the following layer (2, 3 ...) and repeat step 2 until layer = L - 1 
Step 4: 
Comments: Apply the modified SSM algorithm (removal of weight) only to the output layer 
Repeat 
Apply the modified SSM algorithm to eliminate one insignificant weight from this layer 
Until 
There are no remaining insignificant weights or there is only one weight in this layer 


Conclusion: The final optimal structure obtained is labeled NN gprimar = NN ssrpr) 
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15.7 Comments 


All the later algorithms, SSM, IP, and SSIP, , have been applied in real-world applications. In [FF02], the 
authors have applied the SSM algorithm to circle in the square problem while in [M93] another method 
of pruning has been proposed. In [FAN01,SJF96], the authors applied SSM, IP, and SSIP, , to medical 
brain diseases classification problems. In [P97], the authors present another pruning method near to 
SSM algorithm to prune connections between neurons. 


15.8 Simulations and Interpretations 


To test the effectiveness of the later two versions of SSIP, , algorithm, as compared to the SSM and IP 
algorithms, three parameters are used to decide whether or not the pruned network is suitable, namely: 


« The complexity: the total number of links necessary for modeling each application. 

« ‘The sensitivity of learning: the percentage of the number of patterns used during learning, which 
are perfectly trained. 

« The sensitivity of generalization of the patterns, which were not used during the learning phase. 


Two applications are selected to justify the performance of the proposed method. These include 


1. Brain Disease Detection [SJF96]: A NN is used to differentiate between Schizophrenic (SH), 
Parkinson’s disease (PD), Huntington’s disease (HD) patients, and normal control (NS) subjects. 
Each individual is characterized by 17 different variables. 

2. Texture classification [FSN98]: The NN is used to classify some images that are randomly chosen 
from eight initial textures. The least mean square filter coefficients of each texture image are used 
to form the input vector (8 variables) to the neural network. 


For the two problems, several structures (1 input layer, 2 hidden layers, 1 output layer) with different ini- 
tial conditions are examined by the IP, SSM, and SSIP, , algorithms in order to remove the insignificant 
parameters (units or weights connections) for each structure. A statistical study has been performed 
to evaluate both the two new versions of SSIP1 and SSIP2. Hence, 100 realizations of different initial 
structures of weighting coefficients have been used and tested. Tables 15.5 and 15.6 illustrate the average 
performance of each pruning algorithm. 


TABLE 15.5 Summary of Results of Texture Classification 


Percentage Percentage Percentage 


ofPruned ofPruned of Pruned Sensitivity 
Percentage Links in Links in Links in Complexity of Sensitivity of 
of Pruned Layer 1 Layer 2 Layer 3 (Number of | Learning —_ Generalization 
Algorithm Unit (%) (%) (%) (%) Links) (%) (%) 
NN without _ — _ _ 163 79.64 + 10 79.37 £7 
pruning 
NN with IP 26.29 + 11 _ _ _ 114435 78.7448 77.79+5 
[CF97] 
NN with SSM _ 65.87 + 21 55.47 + 18 28.24 + 13 75 +25 87.62 +3 81.06+4 
[CGGMM95] 
NN with SSIP, 26.90 + 7 25+10 24.72+12 6244 90.25 + 10 91.63 +6 86.7544 
[FEJC09] 
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TABLE 15.6 Summary of Results of Detection of Brain Diseases 


Percentage Percentage — Percentage 
Percentage ofPruned oofPruned ofPruned Complexity Sensitivity Sensitivity of 


of Pruned Links in Links in Links in (Number ofLearning Generalization 

Algorithm Unit (%) Layer 1(%) Layer2(%) Layer 3(%) of Links) (%) (%) 

NN without a oe _ a 298.5 47.30 + 20 52.77+10 
pruning 

NN with IP 68.55+ 15 a os _— 68.5+15 43.16+12 50+14 
[CF97] 

NN with SSM — 75.36+5 79.27 + 12 4.80 +3 114+20 52.95 +25 43.154 18 
[CGGMM95] 

NN with SSIP, 39.21+18 65.144 12 44.28+7 22.1845 77.5412 77.15+15 76.39 + 12 
[FFJC09] 


The first thing that can be seen from these tables is that the new proposed pruning algorithms, SSIP, ,, 
indeed offer good learning and generalization capabilities. 


Note that during the pruning process, the SSIP algorithm eliminates around 59% of links from 
the total number in order to achieve an improvement of +39% for the sensitivity of learning and 
+26% for the sensitivity of generalization. 

Whereas for the SSM and IP algorithms, we have 

For the SSM algorithm, there is an elimination of about 57.5% of links, which corresponds to an 
improvement of about +10.95% for the sensitivity of learning and of —8% for the sensitivity of 
generalization. 

For the IP algorithm, there is an elimination of around 53.5% of the links, which corresponds to 
about a —4.94% degradation in the sensitivity of learning and a -3.64% degradation in the sensi- 
tivity of generalization. 


These results highlight the superiority of the SSIP, , versus SSM and IP used separately. 


15.9 Conclusions 


In this chapter, we have discussed several approaches of pruning algorithms in order to obtain a multi- 
layer neural network “optimal” in terms of number of neurons or synaptic coefficients per layer. 

We have detailed three pruning algorithms such as IP, SSM, and SSIP, from what one can understand 
the pruning procedure to be applied in order to handle in practice an optimal structure of a neural net- 
work. From the simulation results of these algorithms, we can conclude the following: 


The effectiveness of any pruning algorithm depends not only on the efficiency rules of the prun- 
ing but also on (1) the manner of applying these rules and (2) the criteria used for stopping these 
pruning algorithms. 

The pruning results depend greatly on the initialization of the starting NN weighting coefficients. 
Sometimes and when one stars the pruning, the user should pay attention to not stop the running 
of the algorithm at the first performance degradation because this may be tricky and he should 
continue the pruning iterations. 

As a concluding remark, in this chapter, we have presented some ideas for pruning NN, while we 
are sure that the pruning problem is an open area of research and we believe that still many things 
remain to be done in this subject. 
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Appendix 15.A: Algorithm of CGPCNE—Conjugate Gradient 
Preconditioned Normal Equation 


This is the procedure used in the pruning algorithm (IP) proposed in [CF97] to solve the following problem: 
To determine the parameter 8 (as shown in Figure 15.A.1) to Minimize ia _ ¥,5|, with: 


rs I T T T T 
Zn =(Zith >» Zinn o> Zikh > Zan ) 


> = 1 2 M)\T 
Zih=Wahn=Walyo yO .. yh) 


Y, = diag (Yinn»Yi2n> «+9 Vinh) 
Yin =n Vr Vinal 


where 
h is the index of the neuron to be detected and removed 
{j,) jz ---» } are the indexes for the set of neurons that feed the neuron i (h not considered in this set) 


Let us define: 
D: a diagonal matrix whose non-zero elements are calculated as follows (D)j =|¥(- if (with the 
notation Y(:, j) indicating the jth column of the matrix Y) 


Outputs Outputs 


We7 + 867 


FIGURE 15.A.1 The procedure for the IP algorithm. 
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L: Triangular matrix defined as follows (Lx = Y(;, j)’ ¥¢, k),j >k. 

C,: A matrix of “preconditioning” is calculated by C,, = (D + wL)D-” where w is a relaxation param- 
eter lying between 0 and 2. 

Hence, the steps in the following procedure CGPCNE are 


1. Initialization: %:=Z-Y5; %:=C,/Y'; Po=5; k=0 


2. Repeat 
_ 2 9 42 
a. Ge = YCo" Py e. Be = Bus if 
2 Nt lp 
— yy ~ nn = 
b. on = [Sif / [ael, f. Past? =Siet + Br Ph 
C. That! = Te — Oe Je g- Baa = 5 + 04.0," Pi 
@. Saat = (Oe a h. k:=k+1 
until }5x—Sx-1] <e, 
2 


where € is a small predetermined constant. 


References 


[CF97] G. Castellano and A.M. Fanelli, An iterative pruning algorithm for feed-forward neural networks, 
IEEE Transactions on Neural Networks, 8(3), 519-531, May 1997. 

[CGGMM95] M. Cottrell, B. Girard, Y. Girard, M. Mangeas, and C. Muller, Neural modelling for time 
series: Statistical stepwise method for weight elimination, IEEE Transactions on Neural Networks, 
6(6), 1355-1363, November 1995. 

[E01] A.P. Engelbrecht, A new pruning heuristic based on variance analysis of sensitivity information, 
IEEE Transactions on Neural Networks, 12(6), 1386-1399, November 2001. 

[FANO1] F. Fnaiech, S. Abid, and M. Najim, A fast feed-forward training algorithm using a modified form 
of standard back-propagation algorithm, IEEE Transactions on Neural Network, 12(2), 424-430, 
March 2001. 

[FF02] N. Fnaiech, F. Fnaiech, and M. Cheriet, A new feed-forward neural network pruning algorithm: 
Iterative-SSM pruning, IEEE International Conference on Systems, Man and Cybernetics, Hammamet, 
Tunisia, October 6-9, 2002. 

[FFJC09] N. Fnaiech, F. Fnaiech, B.W. Jervis, and M. Cheriet, The combined statistical stepwise and iterative 
neural network pruning algorithm, Intelligent Automation and Soft Computing, 15(4), 573-589, 2009. 

[FFNO1] EF. Fnaiech, N. Fnaiech, and M. Najim, A new feed-forward neural network hidden layer’s neu- 
rons pruning algorithm, IEEE International Conference on Acoustic Speech and Signal Processing 
(ICASSP’2001), Salt Lake City, UT. 

[FSN98] EF. Fnaiech, M. Sayadi, and M. Najim, Texture characterization based on two dimensional lattice 
coefficients, IEEE ICASSP’98, Seattle, WA and IEEE ICASSP’99, Phoenix, AZ. 

[JOO] J.-G. Juang, Trajectory synthesis based on different fuzzy modeling network pruning algorithms, 
Proceeding of the 2000 IEEE, International Conference on Control Applications, Anchorage, AK, 2000. 

{LDS90] Y. LeCun, J.S. Denker, and S.A. Solla, Optimal brain damage. In D. Touretzky, ed. Advances in 
Neural Information Processing Systems, Vol. 2, pp. 598-605. Morgan Kaufmann, Palo Alto, CA, 1990. 

[M93] J.E. Moody, Prediction risk and architecture selection for neural networks, Form Statistic to Neural 
Networks: Theory and Pattern Recognition Application. Springer-Verlag, Berlin, Germany, 1993. 

[MHEF06] S. Mei, Z. Huang, and K. Fang, A neural network controller based on genetic algorithms, 
International Conference on intelligent Processing Systems, Beijing, China, October 28-31, 1997. 


© 2011 by Taylor and Francis Group, LLC 


15-16 Intelligent Systems 


[P97] L. Prechelt, Connection pruning with static and adaptative pruning schedules, Neuro-Comput., 
16(1), 49-61, 1997. 

[SJF96] M. Sayadi, B.W. Jervis, and FE. Fnaiech, Classification of brain conditions using multilayer percep- 
trons trained by the recursive least squares algorithm, Proceeding of the 2nd International Conference 
on Neural Network and Expert System in Medicine and Healthcare, Plymouth, U.K., pp. 5-13, 28-30, 
August 1996. 

[W09] B.M. Wilamowski, Neural Network Architectures and Learning algorithms, [EEE Ind. Electron. 

Mag., 3(4), 56-63, November 2009. 

[W10] B.M. Wilamowski, Challenges in Applications of Computational Intelligence in Industrial 

Electronics ISIE10 - International Symposium on Industrial Electronics, Bari, Italy, July 4-7, 2010, 

pp. 15-22. 

[WCSLO1] K.-W. Wong, S.-J. Chang, J. Sun, and C.S. Leung, Adaptive training and pruning in feed-forward 

networks, Electronics Letters, 37(2), 106-107, 2001. 

[WHM03] B.M. Wilamowski, D. Hunter, and A. Malinowski, Solving parity-N problems with feedforward 

neural network, Proceedings of the IICNN’03 International Joint Conference on Neural Networks, 

pp. 2546-2551, Portland, OR, July 20-23, 2003. 

[WY10] B.M. Wilamowski and H. Yu, Improved Computation for Levenberg Marquardt Training, IEEE 
Trans. Neural Netw. 21(6), 930-937, June 2010. 


© 2011 by Taylor and Francis Group, LLC 


16 


Principal Component 
Analysis 


DG TAO dUCtiO tig csivcissssctiousicisarcuiancnioemannniaanaianaianaiens 16-1 
16.2. Principal Component Analysis Algorithm... 16-2 
16.3 Computational Complexity and High-Dimensional Data.......16-4 
Anastasios Tefas 164 Singular Value Decoinpositionisics.ccscicasscsstssasicnrceasecactsisecaseohssee's 16-5 
Aristotle University 16.5. Kernel Principal Component Analysis .ccissvcsscesssenncesisoorosieesitees 16-6 
of Thessaloniki Tie A aaah RM cosas cesaces anicbas a ercaarsicaesarmccaUReRERE 16-7 
ieannis Pitas Tey Applications of PUA wc iaucuiacunuiiuanucsiucusenvaiaeuiied 16-7 
Avistatle Waitwesiey Bae Ras 0S acs ccsncncrencirmceciensisenone eee erie 16-9 
of Thessaloniki eRe sce corse sete eaeaans ease ee 16-9 


16.1 Introduction 


Principal component analysis (PCA) is a classical statistical data analysis technique that is widely used 
in many real-life applications for dimensionality reduction, data compression, data visualization, and 
more usually for feature extraction [6]. In this chapter, the theory of PCA is explained in detail and 
practical implementation issues are presented along with various application examples. The mathemati- 
cal concepts behind PCA such as mean value, covariance, eigenvalues, and eigenvectors are also briefly 
introduced. 

The history of PCA starts in 1901 from the work of Pearson [9], who proposed a linear regression 
method in N dimensions using least mean squares (LMS). However, Hotelling is considered to be the 
founder of PCA [4], since he was the first to propose PCA for analyzing the variance of multidimen- 
sional random variables. PCA is equivalent to the Karhunen—Loeve transform for signal processing [2]. 

The principal idea behind PCA is that, in many systems that are described by many, let us assume 
N, random variables, the degrees of freedom M are less than N. Thus, although the dimensionality of 
the observation vectors is N, the system can be described by M < N uncorrelated but hidden random 
variables. These hidden random variables are usually called factors or features of the observation vec- 
tors. In the following, the term sample vectors will refer to the observation vectors. In statistics, the term 
superficial dimensionality refers to the dimensionality N of the sample vector, whereas the term intrinsic 
dimensionality refers to the dimensionality M of the feature vectors. Obviously, if the N dimensions of 
the sample vectors are uncorrelated, then the intrinsic dimensionality is also M = N. As the correlation 
between the random variables increases, less features are needed in order to represent the sample vectors 
and thus M < N. In the limit, only one feature (M = 1) is enough for representing the sample vectors. 

In the following, we will use lower case bold roman letters to represent column vectors (e.g.,x € RY) 
and uppercase bold roman letters to represent matrices (e.g., W € R™*%). The transpose of a vector or a 
matrix is denoted using the superscript T, so that x7 will be a row vector. The notation (x,, x,,..., Xy) is 
used for representing a row vector and x = (x,, x,,..., Xy)7 is used for representing a column vector. 


16-1 
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The PCA algorithm can be defined as the orthogonal projection of the sample vectors onto a lower 
dimension linear subspace such that the variance of the projected sample vectors (i.e., the feature vec- 
tors) is maximized in this subspace [4]. This definition is equivalent to the orthogonal projection, such 
that the mean squared distance between the sample vectors and their projections in the lower dimen- 
sional space is minimized [9]. 

Let us begin the detailed PCA description by considering that the sample vectors are represented by a 
random vector x; = (x), X,,...,Xy)4,i= 1... K, having expected value E{x} = 0 and autocorrelation matrix 
R, = E{xx"}. According to the PCA transform, the feature vector X,€ ™ is a linear transformation of 
the sample vector x;: 


%,=W'x; (16.1) 


where W € #™ is a matrix having less columns than rows. 

We can consider that if we want to reduce the dimension of the sample vectors from N to 1, we should 
find an appropriate vector w € RY, such that for each sample vector x; € RY, the projected scalar value 
will be given by the inner product, t = wx. If we want to reduce the sample vectors dimension from 
N to M, then we should find M projection vectors w,€ RY, i= 1... M. The projected vector ke R” will 
be given by X = W’x, where the matrix W = (w,, W),..., Wy)- 


16.2 Principal Component Analysis Algorithm 


‘The basic idea in PCA is to reduce the dimensionality of a set of multidimensional data using a linear 
transformation and retain as much data variation as possible. This is achieved by transforming the ini- 
tial set of N random variables to a new set of M random variables that are called principal components 
using projection vectors. The principal components are ordered in such way that the principal compo- 
nents retain information (variance) in descending order. That is, the projection of the sample vectors to 
w, should maximize the variance of the projected samples. 

Let us denote by m, the mean vector of the sample data: 


>; 
m,=— )x;, 16.2 
ze (16.2) 


i=l 
where K is the number of sample vectors in the data set. The mean value of the data after the projection 
to the vector w, is given by 
l K 
Mm, = —y wix =wim, (16.3) 
K 
i=l 
and the variance of the projected data to the vector w, is given by 


K 
1 2 
Sa= K > (wrx, - wim,) =wS.wi, (16.4) 


where S, is the covariance matrix of the sample vectors defined by 


Ix . 
$= EL m,)(x;—m,)". (16.5) 


Without loss of generality, we can consider the principal components to be normalized vectors (i.e., 
W/W; =1), since we need to find only the direction of the principal component. In order to find the first 
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principal component w, we seek the vector w, that maximizes the projected data variance s, in (16.4) 
and has unit magnitude: 


maximize wiS$,W; (16.6) 


subject to wiw, =1. (16.7) 


The above optimization problem can be solved using Langrange multipliers [3]: 
J=wiS,w,+A,(1—wiw)). (16.8) 


By setting the derivative of the Langrangian with respect to the projection vector equal to zero we 
obtain: 


v, 
nie 0> S,w, =A\w, (16.9) 
Ow, 
which implies that the variance s, is maximized if the sample data are projected using an eigenvector 
w, of their covariance matrix S,. The corresponding Langrange multiplier is the eigenvalue that cor- 
responds to the eigenvector w,. Moreover, if we multiply both sides of (16.9) by the vector w,, we get: 


w.S,W, =A.W.W, =A. (16.10) 


That is, the solution to the PCA problem is to perform eigenanalysis to the covariance matrix of 
the sample data. All the eigenvectors and their corresponding eigenvalues are solutions of the opti- 
mization problem. The value of the corresponding eigenvalue is equal to the resulting variance 
after the projection and thus, if we want to maximize the variance of the projected samples we 
should choose as projection vector the eigenvector that corresponds to the largest eigenvalue. In 
the literature, either the projection values or the eigenvectors are called principal components [6]. 
An example of PCA on a two-dimensional data set is illustrated in Figure 16.1. It is obvious that 
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FIGURE 16.1 PCA example. 
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the first principal component is the one that maximizes the variance of the projected samples. The 
second principal component is orthogonal to the first one. 

If we want to proceed in finding the next principal component, we should seek for another projection 
vector w, that maximizes the variance of the projected samples and it is orthonormal to w,. It is straight- 
forward to form the corresponding Langrangian and maximize it with respect to w,. The projection 
vector that maximizes the variance of the projected samples is the eigenvector of the covariance matrix 
that corresponds to the second larger eigenvalue. 

Similarly, we can extract M eigenvectors of the covariance matrix that will form the M principal com- 
ponents. These column vectors form the PCA projection matrix W = (w,, W,;.... Wy) that can be used to 
reduce the dimensionality of the sample vectors from N to M. Let us also note that if an eigenvalue of the 
covariance matrix is zero, then the variance of the projected samples in the corresponding eigenvector 
is zero according to (16.9). That is, all the sample vectors are projected to the same scalar value when 
using this eigenvector as projection vector. Thus, we can conclude that there is no information (i.e., data 
variation) in the directions specified by the null eigenvectors of the covariance matrix. Moreover, we can 
discard all these directions that correspond to zero eigenvalues, without any loss of information. The 
number of nonzero eigenvalues is equal to the covariance matrix rank. The procedure of discarding the 
dimensions that correspond to zero eigenvalues is usually called removal of the null subspace. 

Another interpretation of the PCA algorithm is that it extracts the directions that minimize the mean 
square error between the sample data and the projected sample data. That is, in terms of reconstruction 
error, PCA finds the optimal projections for representing the data (i.e., minimizes the representation 
error). It is straightforward to prove that PCA minimizes the mean square error of the representation [1]. 


16.3 Computational Complexity and High-Dimensional Data 


Let us proceed now in estimating the computational complexity of the PCA algorithm to a data set of K 
samples x; of dimensionality N. To do so, we should firstly consider that we need to calculate the mean 
vector m, and the covariance matrix S, of the data set and, afterward, to perform eigenanalysis to the 
data covariance matrix. The computational complexity will be given in terms of the number of basic 
operations, such as additions, multiplications, and divisions. The complexity of calculating the mean 
vector of the data set in (16.2) is O(KN), since we need K — 1 additions for each of the N dimensions. 
Similarly, the complexity of calculating the covariance matrix in (16.5) is O(KN?). Finally, we have to 
calculate the eigenvectors of the covariance matrix which is a procedure with complexity O(N’) [3]. If we 
want to calculate only the first M eigenvectors, then the complexity is reduced to O(MN’). 

It is obvious from the above analysis that, if the data have high-dimensionality, the computational 
complexity of PCA is large. Moreover, the memory needed for storing the N x N covariance matrix 
is also very large. In many real applications, such as image processing, the dimensionality N of the 
sample vectors is very large resulting in demanding implementations of the PCA. For example, if we 
try to use PCA for devising an algorithm for face recognition, even if we use images of relatively small 
size (e.g., 32 x 32), the resulting sample vectors will have 1024 dimensions. Straightforward applica- 
tion of the PCA algorithm requires the eigenanalysis of the 1024 x 1024 covariance matrix. This is a 
computationally intensive task. The problem is even worse, if we consider full-resolution images of 
many megapixels where applying PCA is practically computationally infeasible. 

In many applications, however, that use high-dimensional data, the number of sample vectors K is 
much smaller than their dimensionality N > K. In this case, the covariance matrix is not full-rank and 
the eigenvectors of the covariance matrix that correspond to nonzero eigenvalues and, thus, constitute 
the principal components are at most K — 1. So, there is no sense in calculating more than K — 1 principal 
components, since these eigenvectors project the data samples to scalar values with zero variance. As 
we have already noted, applying PCA in very high-dimensional data is computationally infeasible and, 
thus, we should follow a different approach for calculating the M < K principal components. 
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Let us define by X = (x, —- m,, x, — m,..., X, — m,) the centered data matrix, with dimensions N x K 
that represents the sample vectors. Each column of the matrix X is a sample vector after subtracting the 
data mean vector. The covariance matrix of the sample data is given by 


S.= xk Xx’. (16.11) 


Let us define by S.= qx an auxiliary matrix of dimension K x K. By performing eigenanalysis to 
S, we can calculate the K-dimensional eigenvectors w, that correspond to the K largest eigenvalues i, 
i= 1,..., K of S,. For these eigenvectors, we have 


$.We lS < X"Xw,= dw, (16.12) 
Multiplying (16.12) from both sides by X, we get 
XX" (Xi) = 1, (KW) = San, = du, (16.13) 


with u,; = Xw,. That is, the vectors u, of dimension N are eigenvectors of the covariance matrix S, of the 
initial sample vectors. Following this procedure, we can calculate efficiently the K — 1 eigenvectors of 
the covariance matrix that correspond to nonzero eigenvalues. The remaining N — K + 1 eigenvectors 
of the covariance matrix correspond to zero eigenvalues and are not important. We should also note 
that these eigenvectors are not normalized and thus, should be normalized in order to have unit length. 


16.4 Singular Value Decomposition 


In many cases, PCA is implemented using the singular value decomposition (SVD) of the covariance 
matrix. SVD is an important tool for factorizing an arbitrary real or complex matrix, with many appli- 
cations in various research areas, such as signal processing and statistics. SVD is widely used for com- 
puting the pseudoinverse of a matrix, for least-squares data fitting, for matrix approximation and rank, 
null space calculation. SVD is closely related to PCA, since it gives a general solution to matrix decom- 
position and, in many cases, SVD is more stable numerically than PCA. Let X be an arbitrary N x M 
matrix and C = X7X be a rank R, square, symmetric M x M matrix. The objective of SVD is to find a 
decomposition of the matrix X to three matrices U, S, V of dimensions N x M, M x M, and M x M, 
respectively, such that 


X=USV', (16.14) 


where U'U =I, V’V = Iand S is a diagonal matrix. That is, the matrix X can be expressed as the product 
of a matrix with orthonormal columns, a diagonal, and an orthogonal matrix. 

There is a direct relation between PCA and SVD when principal components are calculated using the 
covariance matrix in (16.5). If we consider data samples that have zero mean value (e.g., by centering), 
then, PCA can be implemented using SVD. To do so, we perform SVD to the data matrix X used for the 
definition of the covariance matrix in (16.11). Thus, we find the matrices U, S, and V, such that X = USV". 
According to (16.11) the covariance matrix is given by 


S.= eax" — usu’. (16.15) 
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In this case, U is an N x M matrix. The eigenvectors of the covariance matrix that correspond to nonzero 
eigenvalues are stored in the first N columns of U, if N < M, or in the first M columns if N < M. The 
corresponding eigenvalues are stored in S?. That is, diagonalization of the covariance matrix using SVD 
yields the principal components. In practice, PCA is considered as a special case of SVD and in many 
cases SVD has better numerical stability. 


16.5 Kernel Principal Component Analysis 


The standard PCA algorithm can be extended to support nonlinear principal components using nonlin- 
ear kernels [10]. The idea is to substitute the inner products in the space of the sample data with kernels 
that transform the scalar products to a higher, even infinite, dimensional space. PCA is then applied to 
this higher dimensional space resulting in nonlinear principal components. This method is called kernel 
principal component analysis (KPCA). 

Let us assume that we have subtracted the mean value and the modified sample vectors x; have zero 
mean. We can consider nonlinear transformations (x,) that map the sample vectors to a high-dimensional 


K 
space. Let us also assume that the transformed data have also zero mean. That is, > o(x;) = 0. The 
covariance matrix in the nonlinear space is given by = 


st = Yoox)" (16.16) 


The next step is to perform eigenanalysis to $°, whose dimension is very high (even infinite), and direct 
eigenanalysis is infeasible. Thus, a procedure similar to the one described for high-dimensional data in 
Section 16.3 can be followed [10], in order to perform eigenanalysis to the auxiliary matrix of dimension 
Kx K, as described in (16.12), considering nonlinear kernels. 

‘That is, in order to perform KPCA the auxiliary kernel matrix K, should be computed as follows: 


[K. tig = (x;)" O(x;) = K(x;,x;), (16.17) 


where X is an appropriate nonlinear kernel satisfying Mercer’s conditions [10]. 

The matrix K, is positive semidefinite of dimension K x K, where K the number of data samples. 
Afterward, eigenanalysis is performed to the matrix K, in order to calculate its eigenvectors. Once again 
the null space is discarded, by eliminating the eigenvectors that correspond to zero eigenvalues. The 
remaining vectors are ordered according to their eigenvalues and they are normalized such that v; v; =1 
for all i that correspond to nonzero eigenvalues. The normalized vectors v, form the projection matrix 
V =(v,, V..5 Vp). 

The nonlinear principal components x of a test sample x are calculated as follows: 


K K 


[i], =(vi0@))= Sty hoe)" 00) = Sly. K,.x,), (16.18) 


i=l i=l 


where [-], denotes the ith element of the corresponding vector or matrix. The kernels K that are most 
commonly used in the literature are the Gaussian K(x, x,) = exp(p|lx, — xP) and the polynomial 
K(x;,Xj) = (x; x j +1)? ones [13]. We should also note that an appropriate centralization [10] should be 
used in the general case, since the data samples do not have zero mean in the high-dimensional space. 
Another remark is that KPCA applies eigenanalysis in square matrices having dimensions equal to the 
number of samples and, thus, it may be computationally intensive, if there are many sample vectors in 
the data set. 
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16.6 PCA Neural Networks 


PCA can also be performed without eigenanalysis. This can be achieved by using neural networks that 
can be trained to extract the principal components of the training set [2]. These neural networks are 
based on Hebbian learning and have a single output layer. The initial algorithm for training these net- 
works has been proposed in [7]. The PCA neural networks are unsupervised. Compared to the standard 
approach of diagonalizing the covariance matrix, they have an advantage when the data are nonsta- 
tionary. Furthermore, they can be implemented incrementally. Another advantage of the PCA neural 
networks is that they can be used to construct nonlinear variants of PCA by adding nonlinear activation 
functions. However, the major disadvantage is the slow convergence and numerical instability. 
Another way to perform PCA using neural networks is by constructing a multilayer perceptron having 
a hidden layer with M neurons, where M is smaller than the dimension N of the input data. The neural 
network is trained in order to produce the input vector at the output level. Thus, the input vector is reduced 
in dimension in the hidden layer and it is reconstructed in the output layer. In this case, the neurons in the 


hidden layer perform PCA. Nonlinear extensions of this neural network can be considered for perform- 
ing nonlinear PCA. The optimization problem solved by this network is hard and convergence is slow [2]. 


16.7 Applications of PCA 


PCA has been successfully used in many applications such as dimensionality reduction and feature 
extraction for pattern recognition and data mining, lossy data compression, and data visualization [1,2]. 
Pattern representation is very critical in pattern recognition applications and PCA is a good solution for 
preprocessing the data prior to classification. 

In face recognition, which is one of the most difficult classification tasks, PCA has been used to 


develop the EigenFaces algorithm [12], which is considered as the baseline algorithm for comparison. 
Features with good approximation quality, such as the ones produced by PCA, however, are not always 
good discriminative features. Thus, they cannot usually be used in classification tasks, whenever class- 
dependent information is available. A solution to this problem is to use the PCA as a preprocessing step 
only. Discriminant analysis is applied in a second step, in order to compute linear projections that are 
useful for extracting discriminant features [8,11]. An example of PCA versus linear discriminant analy- 
sis for classification is shown in Figure 16.2. It is obvious that, although PCA finds the projection that 
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FIGURE 16.2 PCA versus linear discriminant analysis in classification problems. 
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better represents the entire data set that comprises of the samples of both classes, it fails in separating 
the two classes samples after the projection. Instead, linear discriminant analysis can find the projection 
that better separates the samples of the two classes. 

Kernel PCA provides a drop-in replacement of PCA which, besides second order correlations, takes 
also into account higher order correlations. KPCA has been successfully used as a preprocessing step 
in the KPCA plus linear discriminant analysis algorithm that has been proven very efficient for many 
demanding classification tasks [5]. KPCA has been also used as a preprocessing step to support vector 
machines in order to give a powerful classification algorithm in [14]. 

PCA can also be used for lossy data compression. Compression is achieved by using a small number 
of principal components instead of the full-dimensional data. For example, data samples that lie in RY 
can be compressed by representing them in R”, with M «N, using the M principal components of each 
data sample. An example is given in Figure 16.3, where the compression result of an image using PCA 
is illustrated. The original image is split in subimages of size 16 x 16 and then PCA is applied in order 
to extract the most representative information. From the 256 dimensions only 16 are retained and the 
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FIGURE 16.3 Image compression using PCA. The original image in (a) is reconstructed in (b) using only 16 
eigenvectors from the 256 of the original image. The mean-squared error is plotted in (c). 
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reconstructed image is shown in Figure 16.3b. The mean-squared error of the reconstruction is plotted 
in Figure 16.3c, where it can be observed that as the number of the retained eigenvectors increases, the 
mean square error of the image reconstruction decreases rapidly. The compression ratio can be con- 
trolled by the number of the principal components that should be retained. 


16.8 Conclusions 


PCA is a very powerful tool for the statistical analysis of a data set. It provides low-dimensional repre- 
sentations of the sample data, by retaining as much data variation as possible. It is extensively used in 
many applications for dimensionality reduction, feature extraction, lossy data compression, and visu- 
alization. Many variants of PCA have been proposed for efficient and stable extraction of the principal 
components, extension to nonlinear components and combination with powerful data classification 
algorithms. 
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17.1 Introduction 


This chapter introduces a family of neural network (NN) control architectures known as adaptive critic 
controllers as a natural extension of simpler architectures. The merits of each architecture are discussed 
and their shortcomings exposed, which in turn becomes the motivation for the next. The first architec- 
ture is an application of a single NN with a classical training algorithm, which implies the requirement 
of full knowledge of the plant’s dynamics at all times. The controller is then improved by the addition 
of a second NN capable of generating online a map of the plant’s dynamics; however, the training algo- 
rithm remains fundamentally the same. 

The addition of a third NN and a change in the training paradigm leads to the adaptive critic archi- 
tecture known as heuristic dynamic programming (HDP), followed by dual heuristic programming 
(DHP). Finally, the developments culminate in a full description of the most advanced adaptive critic 
architecture so far, known as globalized dual heuristic programming (GDHP). Presented in great 
detail, the particular GDHP training algorithm contained in this chapter was developed for application 
in the demanding field of fault tolerant control (FTC) [1]. In order to better comprehend the needs that 
drive many researchers to seek the great potential adaptive power of GDHP and also to give perspec- 
tive to some examples presented in the end of this chapter, a short introduction to FTC is available for 
the readers. 


17-1 
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The ultimate goal of this chapter is to provide the readers with basic motivation, background, and 
description of different adaptive critic controllers by presenting a series of NN adaptive controller archi- 
tectures ranging from a single NN adaptive controller to GDHP. 


17.2 Background 


Increased performance requirements are often achieved at the cost of plant and control simplicity. As 
overall complexity rises, so does the chance of occurrence, diversity, and severity of faults. Therefore, 
availability, defined as the probability that a system or equipment will operate satisfactory and effectively 
at any point of time [2], becomes a factor of great importance. For automated production processes, for 
example, availability is now considered to be the single factor with highest impact on profitability [3]. 

FTC is a field of research that aims to increase availability and reduce the risk of safety hazards by 
specifically designing control algorithms capable of maintaining stability and/or performance despite 
the occurrence of faults [4]. As complex systems suffer from faults, the original model parameters or 
even its own dynamic structure may change in a multitude of unpredictable ways. Even if the system 
has a satisfactory linearization around the nominal operation point, nonlinearities may become of para- 
mount importance after a fault occurs [5]. When the stochastic nature of faults is taken into consider- 
ation and to even predict all fault scenarios is made impossible, it becomes clear that the problem of 
interest of FTC cannot be dealt with without an online nonlinear adaptive control strategy. Successful 
applications of adaptive critic architecture controllers to FTC problems [6] have been credited to the 
controllers’ great flexibility and known effectiveness to work in noisy, nonlinear environments while 
making minimal assumptions regarding the nature of that environment [7]. 

It is important to state here that, for the benefit of the discussion in this chapter, the required redun- 
dancy is assumed to exist in the system. Hardware redundancy requires two or more independent 
instruments that perform the same function, while analytical redundancy uses two components based 
on different principles to measure a variable, where at least one of them uses a mathematical model in 
analytical form. In either case, from the theoretical point of view, this assumption matches the require- 
ment for sustained observability and controllability (or global reachability for nonlinear systems) 
through fault scenarios. 


17.3 Single NN Control Architecture 


The goal of this approach is to use a NN to generate a nonlinear map connecting the states of the plant 
x(f), previous inputs u(t — 1), and current target x‘(f) to an input u(t) that will minimize the utility func- 
tion U(f) defined by Equation 17.1: 


UW) = (x -2')' Q(x") + putty” (7.1) 


where 
Qis a diagonal square matrix that can be used to assign different degrees of importance to each state 
Ris the equivalent matrix that penalizes the amount of control action used 
p is a scalar used to balance the minimization of the tracking error and the energy use during the 
process 


In order to differentiate it from the other NNs that will be introduced in later architectures, this 
NN is named action neural network (AcNN). Figure 17.1 depicts such architecture. When performing 
the training of the AcNN, the information of how its weights affect the states of the plant is required. 
However, backpropagation through the AcNN only provides information on how the inputs u(t) are 
affected by its weights. Therefore, this approach requires the availability of a differential model of the 
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x(t) 


FIGURE 17.1 Single NN control architecture. 


dynamics of the plant from which the information on how the states x(t) are affected by the inputs 
u(t) can be extracted. As a result, this architecture is not suitable for FTC application, since faults are 
assumed to modify the dynamics of the plant in unpredictable ways, making it impossible to design 
models beforehand. 


17.4 Adaptive Control Architecture Using Two NNs 


Since it is not possible to offline design models of the plant dynamics for all fault scenarios, in this archi- 
tecture a second NN is introduced with the goal of performing online plant identification. Once this 
network has converged to represent a map of the dynamics of the plant, the derivative of the states with 
respect to the inputs can be extracted through standard backpropagation. Such network will be referred 
to as the identification NN (IdNN). Figure 17.2 displays this second approach. 

Although no critical restrictions prevent this architecture to be used as a solution to the FTC prob- 
lem, its performance can still be largely improved if the training algorithm for the AcNN is reevaluated. 
In these first two architectures, the AcNN is trained at each iteration with the goal of reducing the 
current value of the utility function U(#). This is performed under the assumption that this process will 
ultimately lead to a set of weights that minimize the utility function for all times. However, this train- 
ing approach provides no mechanisms to minimize the values that U(f) assumes during training (or the 
time it takes). Clearly, it is of the interest of FTC to provide a new control solution to a fault scenario as 
quick as possible and with minimum performance impact. 


17.5 Heuristic Dynamic Programming 


Seeking to overcome the limitations of the previous approaches, the first adaptive critic controller is 
introduced. Adaptive critic architectures have a much greater potential to achieve the required degrees 
of reconfiguration and stability because more than the simple instantaneous difference between desired 
and actual states is available to be used as performance index. Due to the continuous interaction between 


x(t) 


D 


FIGURE 17.2 Direct adaptive control architecture using two NNs. 
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x(t +1) 


FIGURE 17.3 Heuristic dynamic programming. 


the controller and the plant, the quality of a certain control strategy can only be fully measured after 
analyzing all future effects it has on the control mission, in our case trajectory tracking. 

Therefore, HDP trains the AcNN to minimize not only the present utility function, but also the sum 
of all future values of U(t) with a decaying factor y (0 < y< 1). Such quantity is referred to as the cost-to- 
go J(t), as defined by the Hamilton-Jacobi-Bellman Equation 17.2, and represents the core of dynamic 
programming [10]. 


J(t)= Yiv'ue +k) (17.2) 


k=0 


Problems formulated in this form are the main focus of dynamic programming, which solves it through 
a backward search from the final step [8]. To make the problem tractable to an online learning approach, 
adaptive critic designs require an estimate of the actual cost-to-go to be constantly determined [9,10]. 
Although ACDs can be implemented with any differentiable structure [11], NNs have been widely used 
[12] due to their generalization and nonlinear mapping capabilities as well as having suitable methods 
for online learning. Given the complexity of FTC systems, dynamic or recurrent NNs were chosen due 
to their more efficient handling of dynamic nonlinear mapping [13]. It is in this context that we intro- 
duce a third NN, denominated the critic neural network (CrNN), responsible for approximating J(f). The 
resulting block diagram is shown in Figure 17.3. 

In other words, the training of the AcNN is done in the direction of the minimization of the cost- 
to-go approximation. In HDP, this is accomplished by starting the training path of the ACNN with 
the information of how the inputs and states will affect the current cost-to-go J(t). Since the CrNN 
is trained to estimate it, such information can be easily extracted from the NN via backpropagation 
though time [14,15]. 


17.6 Dual Heuristic Programming 


DHP reevaluates the purpose of the CrNN and redesigns it. Although in HDP the CrNN is trained to 
estimate J(#), its true purpose is to provide the AcNN with the partial derivatives of J(t) with respect to 
the states and inputs (usually referred to as A*(t) and A“(f), respectively). In DHP architecture, as shown 
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FIGURE 17.4 Dual heuristic programming. 


in Figure 17.4, the CrNN is trained to output such derivatives directly. Using this direct approach, DHP 
is capable of generating smoother derivatives and has shown improved performance when compared to 
HDP. Those results were presented in Ref. [11], where both methods were applied to a turbogenerator, 
characterized as a highly complex, nonlinear, fast-acting, multivariable system with dynamic charac- 
teristics that vary as operating conditions change. Also, results from the application of DHP to the FTC 
challenge from early stages of the presented work can be found in Ref. [6]. These benefits come with the 
tradeoff of a more complex training algorithm for the CrNN as shown in Ref. [16]. 


17.7 Globalized Dual Heuristic 
17.7.1 Introduction 


The adaptive critic GDHP algorithm combines the HDP and DHP approaches to generate the most com- 
plete and powerful adaptive critic design [7]. In GDHP, A*(f) and A(t) are determined with the precision 
and smoothness of DHP, while improving the CrNN training by also estimating J(f) as in HDP [17]. 
Figure 17.5 depicts the block diagram of this approach. 

In this section, the adaptive critic architecture of GDHP is presented in detail. Following this intro- 
duction, the adaptive control problem of interest to FTC is stated mathematically and the adopted nota- 
tion introduced. The next three subsections are focused each on one of the NNs that composes the 
GDHP architecture: identifier, action, and critic. Each NN has its structure presented, followed by a dis- 
cussion on its training algorithm and the ways through which information required by other networks 
is extracted. Finally, all information contained in this section is summarized in the complete GDHP 
algorithm presented in a manner that can be readily applied. 


17.7.2 Preliminaries 


The first step is to define x(f) in Equation 17.3 and u(t) in Equation 17.4, column vectors of the nx states 
and nu inputs at time f, and the Tap Delay Line (TDL) vectors x(é) in Equation 17.5 and u(t) in Equation 
17.6 that combine information of TDL, and TDL, sampling times, respectively. 


xt)=[at) at) x(t) (17.3) 
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xi(t+1) (£41) 
N(E+ 1) 


“(t) 


FIGURE 17.5 Globalized dual heuristic programming. 


u(t)=[m(t) ult) + tht) (17.4) 
x(t)=[ x(t)” x@-)" + x(¢-TDL, +)" | (17.5) 
u(t)=[u(t)"  ut-? u(t - TDL, +1)" J’ (17.6) 


Given the causal plant described in Equation 17.7 with nonlinear f() subject to abrupt faults character- 
ized by discontinuous changes in its parameters or structure, the primary goal of the controller (17.8) is 
to make the states track the desired trajectory x‘(f). Since particular fault scenarios may render regions 
of the state space unreachable to the plant, the controller is not required to reduce the tracking error 
to zero, but rather minimize it under the constrains of each particular fault. In the controller, g(-) is a 
nonlinear continuously differentiable approximator composed of three NNs: identification, action, and 
critic. The way each NN is trained online and how they interact in the GDHP architecture is explained 
in detail in the following sections. 


x(t) = f (x(t-1), w(t-1)) (17.7) 


u(t) = g(x), u(t 1), x'(t)) (17.8) 


17.7.3 Identification Neural Network 


The IdNN (shown in Figure 17.6) is responsible for generating a differentiable map that matches the 
dynamics of the plant. Note that, in the used notation, all variables related specifically to the IANN 
receive the superscript i. Designed as a two-layered recurrent NN [18,19] with input p‘(¢) in Equation 
17.9, nhi neurons in the hidden layer and a tangent sigmoid transfer function in Equation 17.10, the 
IdNN outputs a vector of the estimated states x‘(f) (11). 
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x(t—1)TD 


FIGURE 17.6 IdNN RNN architecture. 


X(t —1) 
: u(t —1) 
pty= alt—1) (17.9) 
1 
a‘(t)= tansig(W"'(t)p'(t)) (17.10) 
, a 
x'(t)=W'**(t) ; (17.11) 


The network is trained online with the goal of minimizing the identification error E‘(f) subject to the 
relative importance matrix S in Equation 17.12. Generally, the matrix S is set as the identity; however, by 
adjusting the magnitude of the diagonal elements, the IANN can be made to focus more on the reduc- 
tion of the identification error of certain states. By applying the steepest descent training algorithm, the 
weight update (valid for both layers) is given by Equation 17.13. 


EQ) = 5(x\0-x0)' $(x')-x0) (17.12) 
wi(t+1) =w'(t)— (arr ] S(x'(t)- x(t) (17.13) 
Ww 


where 
w' is a column vector of the elements of the corresponding weight matrix w’ 
Bi is the learning rate 


Equations 17.14 through 17.16 show how the required derivatives are calculated. 


ii pity 0 0 i 
aA =(1—ding(a')) ||] 0 0 [+ We miene nv) | tg 


dw" 
0 0 p(t)’ 
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dxi(t) _ 


; da'(t 
qa Woinniy(t) oo 


dw! 


(17.15) 


[a'" 1 0 0 
= 0 me 0 (17.16) 


0 0 [a'(e)" 1] 


where 
w' corresponds to the weights of the first layer 
w? to those of the second 
The standard MATLAB® notation for the indication of rows and columns within a matrix is used. In 
such notation, when used by itself, the colon indicates all entities in a particular dimension (e.g., 
all rows or all columns), while when used between two numbers or variables it indicates the range 
between and containing such values in the corresponding dimension (e.g., all rows from 5 to nhi) 


In order to train both the ACNN and the CrNN, information on the plant dynamics is required. Once 
the IdNN has converged to an estimator of the plant, the derivative of the output with respect to the 
input calculated by Equations 17.17 through 17.19 can be used as an approximation to part of the plant 
dynamics. Equations 17.20 and 17.21 show how the previous derivative is used to build the complete 
dynamic description when TDL, is greater than 1. 


4210 Oa da'(t) | (17.17) 


di(t) dii(t — 1) ¢ssend—nu) 
dx(t) 
du(t) 
dette) ~ diagla' ) a dult) 17.18 
dat) diag(a'(e+D) Wehr +D di(t) ee 
da'(t) 
du(t) 
dx(t+1) dx(t+1) 0 da'(t +1) 
=. = Wank 2 ya s 
di(t) diu(t) eae” du(t) ve 
ax(t) _ O(nx*TDLx,nu) : ae) (17.20) 
du(t) ; du(t — 1) (:, :end—nu) 
dx(t +1) 
axes | OM) (v7.21 
du(t) _ 
dXx(t) 
du(t) (l:end—nx,:) 


The information on the plant dynamics is completed with the knowledge of how the current and past 
states affect the state on the next step. Therefore, there is also need to use the differential map of the 
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IdNN to calculate the derivative of x(t + 1) with respect to x(t). The process through which such deriva- 
tive is obtained, detailed in Equations 17.22 through 17.26, is analogous to the one performed in (17.17) 
through (17.21). Note that in the process of calculating both derivatives, the causality of the plant is 
taken into consideration. However, while causality restricts dx(t + 1)/du() to a block upper triangular 
matrix, dx(t + 1)/dx(f) is an upper triangular matrix with ones in the diagonal. 


da'(t) _ .  da'(t) 
dx(t) [an "x(t —1) | ae 
dx(t) 
dx(t) 
dai (t +1) =(1-4i j “wa di(t) 
EO) iag(a‘(t+1)) |We.enay(t +1) iz) (17.23) 
da'(t) 
dx(t) 
dx(t+1)_ dx(t+))_ a dai(t +1) 
a) dy om’ BD ned) 
Tens : 
dx) | dC) 79s) 
dx(t) 0, iid : dx(t _ 1) (,l:end—nx) 
dx(t +1) 
= 
dx(t +1) _ Bein (17.26) 
dx(t) 
dx(t) 
dx(t) (l:end—nx,:) 


17.7.4 Action Neural Network 


The core of the GDHP adaptive controller, the AcNN is responsible for the generation of the control 
input u(f). Similar to the IdNN, the AcNN is also built on a two-layered architecture, as can be seen in 
Figure 17.7 and in the network description in Equations 17.27 through 17. 29. Equivalently, the super- 
script a is used over all variables specifically related to the AcNN. 


X(t) 
u(t -1) 
P’() =| a(t -1) (17.27) 
x'(t) 
1 
a“ (t) = tansig(W"'(t)p“(t)) (17.28) 
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FIGURE 17.7. AcNN RNN architecture. 


a*(t) 
u(t) = wo : | (17.29) 
The training of the AcNN has the goal of producing the control sequence u(t) that minimizes the cost 
function J(t), defined in Equation 17.2 as the sum of all future values of the utility function U() (1) with 
a decaying factor y (0 < y< 1). The diagonal matrices Q and R have the same purpose as S in the IANN 
while p adjusts the degree at which the amount of energy spent in the control effort is penalized relative 
to the tracking error. 

As in the IdNN, a steepest descent training algorithm was applied, resulting in the update Equation 
17.30. For reasons that will become clear in the description of the critic, the differentiation of J(t) with 
respect to the weights of the AcNN is not performed directly from the infinite sum (17.2). The relation- 
ship (17.31) is used instead, resulting in Equation 17.32. 


weernewoo-B (Sh | (17.30) 
dw 
J(t)=Ut)+y* J+) (17.31) 
as (ae eee ae ] y (17.32) 
dw* du(t) du(t) du(t) } dw® 
where 
B* is the learning rate of the ACNN 
#7 = IO) dj(t) 
M(t) = “f= 
(t) aE) and A“(t) autt) are outputs of the CrNN 


The next step is the calculation of the derivative of the input with respect to the weights of the AcNN. 
Equations 17.33 through 17.34 for the first layer and Equations 17.35 through 17.36 for the second layer 
were derived in the same fashion as Equations 17.14 through 17.16 of the INN. Equation 17.37 describes 
the way the full temporal derivative is obtained for both layers. It is important to call to attention that, 
different from the INN, the AcNN is positioned in a closed loop with the plant. Therefore, in Equations 
17.33 and 17.35, the ACNN derivation path extends to include information on the dynamics of the plant, 
approximated by the IdNN. 
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dx(t) du(t —1) 
du(t—1) dw 


py 0 0 


ao () (1- diag(a “)') 0 ze 0 | + Wohend—nx—p(t) stioe z (17.33) 
0 0 pw! da*(t -1) 
a 
a = Wotinha) — ua! _ (17.34) 
dx(t) du(t —1) 
du(t—1) dw” 
da‘ (t sy du(t —1 
fae -(1- diag(a" (wo) \Wtbet-m-o(t) MS (17.35) 
da‘(t -1) 
dw” 
fates oe | oe da*(t) 
dw” = . 0 Wexinha) (t) dw” (17.36) 
0 0 [ar(t)’ 1 
du(t) 
ae ey. (17.37) 


dw* | du(t-1) 
AW -end—nu,) 


On Equations 17.18 and 17.32, the derivative of the tap delayed input with respect to itself was required. 
Equations 17.38 through 17.42 display how those are calculated. Since W*(f + 1) is not yet available at this 
stage, the terms with superscript tilde are obtained using W(t) as an approximation. Note that p*(t + 1) 
used for the calculation of a(t + 1) can be generated by using the IdNN to estimate the future states of 
the plant assuming x'(t + 1) available. 


dal(t) = [os nu) : aa) ] (17.38) 


du(t) du(t —1), :l:end—nu) 
dx(t +1) 
du(t) 
da*(t+1) Z 2) at du(t) 17.39 
du(t) =(1-ain +0) JWeheaen-n(D dii(t) ee 
da‘(t) 
du(t) 
dult+1) a da*(t +1) 
aa) NO Gc ae 
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di(t) Zo gate) 
C2 an eee oe 1741 
du(t) . du(t a 1) (:,:end—nu) 
O(nu*(TDLu-1),nu) : 
du(t +1) 
_ du(t 
du(t +1) _ Rue (17.42) 
du(t) = , 
du(t) 
du(t) (l:end—nu,:) 


In later developments, the information on how the future input is affected by the present states of the 
plant is required. For such purpose, Equations 17.43 through 17.47 are provided. 


aa") , _da"(t)_ 
dx(t) = oo . dx(t = 1) a (17.43) 
dx(t) 
dx(t) 
a 5 4 3 2 1 
ae a 7 ( I -diag(a*(t)) When ee ayaa) 
dx(t) 
du(t) a da‘(t) 
dx(t) Westent-v(t) dx(t) (17.45) 
— = oe : > | (17.46) 
(:,l:end—nx) 
du(t) 
= dx(t) 
“0 SH seeden secs ap 
“ du(t —1) 


d X(t) iend—nu.2) 


17.7.5 Critic Neural Network 


The third and final NN, the critic is responsible for the estimation of the cost function J(é) and of its 
derivatives with respect to the inputs and states (A“(t) and A*(0), respectively). Consistent with the nota- 
tion of the other two NNs, all variables specifically related to the CrNN are marked by a superscript c. 
As shown from the network description in Figure 17.8 and Equations 17.48 through 17.50, the before- 
mentioned derivatives are obtained directly as outputs of the network, instead of through backpropaga- 
tion from the cost function. 
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x(t) TD 
—> 


u(t) TD 


FIGURE 17.8 CrNN RNN architecture. 


x(t) 
bs a AE 
pt)= He) (17.48) 
1 
a‘(t) = tansig(W“(t)p“(t)) (17.49) 
M(t)" 
plea ae 
MO) |=W* 0 ; | (17.50) 
I(t) 


The GDHP critic’s weight update Equation 17.51 is a combination of the training algorithms of HDP 
(minimizing the estimation error of J(t)) and DHP (minimizing the estimation error of A(#)). Although 
the influence of the HDP and DHP algorithms can be decoupled in the update of the weights of the 
second layer, both terms equally affect all the weights of the first layer. This superposition of training 
approaches in the first layer of the CrNN is the main source of the synergy of GDHP [20]. 


dn*(t) | ; 
Y € A(t) | | A* (t) 
w(t) =w(t)-BO oe) UO ro)-By| Polkeo| (1751) 
dw‘ 


where 
B« is the learning rate of the CrNN 
7 € [0, 1] is a parameter that adjusts how HDP and DHP are combined in GDHP 


For n = 0, the training of the CrNN reduces to a pure HDP, while n = 1 does the same for DHP. 

Since the cost function J(f) is a weighted sum of present and future variables, the targets, ]°(#), Non (t), 
and i“ (t), are not analytically available when performing online learning. In order to generate values 
that will in time converge to the true targets, relationship (31) is used, resulting in Equations 17.52 
through 17.54. 


POE)=UO+y+*JE+) (17.52) 
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vray I(t) dU(t) . dx(t+1) ., du(t +1) 

* (t)= Bae Ono) + (2 ae ae +X eer ) (17.53) 
vy I(t) _ dU(t) x dx(t+1) ,, du(t +1) 

MM (t)= aut) aul) * (2 (t +1) aul) +A“(t+1) dul) ) (17.54) 


The next step is the calculation of the partial derivatives of the critic's outputs with respect to its weights. 
Equations 17.55 through 17.57 demonstrate how those are obtained. 


p(t)’ 0 0 


se Oe (r- diag (a‘ ()'] 0 0 + Welndemni-n(t) ) (17.55) 
0 ao pa: 
dn (t)* 
dw" 
dn“(t)’ | _ da‘(t) 
Tw! Weirend-1 (t) Iw! (17.56) 
dj(t)" 
dw" 
dx (t)" 
dw? a T 
[a*(t) 1] O 0 
u E 
ih Y =. 0 .s 0 (17.57) 
dw 
dj(t)" 0 0 [a°(t)" ]] 
dw? 


Completing the requirements of Equations 17.53 through 17.54, the partial derivatives of the utility 
function with respect to the states and inputs are provided in Equations 17.58 through 17.59. Equation 
17.60 shows how the full derivative of the utility function with respect to the inputs is calculated, as 
required in (17.32). 


oU(t) sony 
(8) =(x(t)-x'() S (17.58) 
OUD) ns dar 


dU(t) _ QU(t) dx(t) r dU(t) du(t) 


du(t)  ax(t) du(t) du(t) du(t) (17.60) 


17.7.6 Complete GDHP Algorithm 


A key issue in all adaptive critic designs implementation is how to coordinate the online training of the 
three NNs. While the IdNN is trained independently since it uses information of the plant alone, the 
training of each ACNN and CrNN depends on the weights of the other. If no provisions are made, both 
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TABLE 17.1 Pseudocode for the Presented GDHP Controller 


1. Set t= 1, e= 1. Initialize NNs weights and network derivatives. Estimate x/(1) 

2. Sample the plant states x(t) and desired trajectory x‘(t) 

3. Update the weights of the IdNN by generating w(t + 1)—Equations 17.13 through 17.16 
4 


. Feedforward through all 3 NNs (AcNN and CrNN twice) to generate in this order: u(t), x(t + 1), 
ii(t + 1), (1), AM(E), I(t), AX(t+ 1), AM(t + 1) and J(t+ 1)—Equations 17.9 through 17.11, 17.31 
through 17.32, and 17.48 through 17.50 

5. Calculate U(t)—Equation 17.1 
dx(t+1) dx(t+1) du(t+1) du(t) 
dat)’ dx)’ dul)" dx) 
through 17.26 and 17.38 through 17.47 

7. Calculate Sat) _Bquations 17.58 through 17.59 

8. Update the weights of the AcNN by generating w(t + 1)—Equations 17.30 through 17.37 

9. Update the weights of the CrNN by generating w(t + 1)—Equations 17.51 through 17.57 

10. If e= epoch, copy the weights of CrNN#1 to CrNN#2 and set e = 1 
ll. t=t+1,e=e+1. Return to2 


6. Backpropagate to generate —Equations 17.17 


networks are forced to follow a moving target, making the whole process potentially slower and likely 
unstable. In Ref. [20], four different strategies were discussed and compared through the application 
on two different test beds, demonstrating the superior performance, stability, and reduced training 
time of a particular one that we choose to implement. Although the original work was developed for 
the DHP architecture, the extension to GDHP is straightforward. The strategy of interest differs from 
others by the fact that it utilizes two distinct NNs to implement the critic. The first (CrNN#1) outputs 
J(t) and A(f) and is trained at every iteration whereas the second (CrNN#2) outputs J(t + 1) and A(t + 1) 
and is updated with a copy of the first only once at a given period of iterations (i.e., epoch). With such 
training approach, it is possible to train both ACNN and CrNN continuously allowing the adaptive 
critic controller to start responding to a fault as soon as it occurs. 

With all the mathematical content of GDHP already available in Equations 17.1 through 17.60, a 
pseudocode version of the actual algorithm is presented in a condensed format in Table 17.1. 


17.8 Fault Tolerant Control 


Failure prevention is not a new concept in theory or practice of engineering. The components or 
machinery that forms a system are often built with safety protections such as fuses or limit switches. 
Continued operation or start-up is prevented if sensors like those inform that conditions are met to 
enter a local shutdown mode. This local safety approach though, does not guarantee global fail-safe 
operation for the complete system. A ship propulsion system depicts an example where the application of 
local safety with the lack of analysis of the global implications resulted in many events where consequences 
vary from irregularity to major economic loss and casualties [3]. 

Another failure prevention approach derives from the use of direct hardware redundancy. If three 
or more independent sensors are used to directly measure the same variable, a majority voting can be 
used not only to detect a fault, but also to isolate the faulty sensor. When only two redundant sensors 
are available, isolation is not necessarily achievable, but fault detection is still guaranteed. The remedial 
action to be taken is then simply ignoring the isolated sensor or generating an alarm when no trustable 
signal is available. 

The same principle is applied to components and actuators, though it is possible in those cases that 
more than one output from different elements operating at only a fraction of its total capability is used 
at the same time. After a fault is isolated in one of the elements, the failure prevention approach then 
becomes one of energy redistribution among the healthy set. 
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FTC’s goal is to prevent failures at system level through proper actions in the programmable parts of 
a control loop. In this approach, analytical redundancy can be used in place of its hardware counter- 
part. Analytical redundancy helps not only to reduce the cost involved in using extra elements, but also 
delivers greater design freedom to avoid the loss of performance that may result from direct hardware 
redundancy implementation. When sensors are considered, the use of analytical relations united with 
the actual measurements also increases the degree of confidence of the considered variable. Since FTC 
focus on the overall mission goal and aims for continuous system availability, different from the other 
failure prevention approaches mentioned earlier, a loss of performance is allowed after a fault occurs. 
As a matter of fact, given the specific redundancies available in a given system, a reconfiguration to a 
state of inferior performance might be an optimal solution when the mission objective, such as stability, 
is preferred. 


17.8.1 Passive versus Active Approaches 


One possible way to implement fault tolerance is to design static control laws capable to compensate for 
some plant uncertainties such as disturbances and noise [5]. If the effects of a fault are small enough to 
be in the range covered by the robustness of the controller, no specific reconfiguration is required. Since 
no information about the faults is typically utilized by the control system, this type of approach is often 
referred to as “passive FTC.” 

By utilizing fault information extracted from the system, it becomes possible to design a reconfigurable 
controller that modifies the control function (parameters or structure) in response to faults, characterizing 
an “active FTC.” This approach is preferable over the passive one when tolerance to a wider range of faults 
is intended since the required increase in robustness has a negative effect on the performance, even under 
nominal operation. As depicted in the generic active FTC diagram in Figure 17.9, it is common to separate 
the control algorithm into two distinct blocks: a baseline controller and a supervisor system. While the 
baseline controller focuses on the maintenance of the immediate control objectives, the supervisor extracts 
fault information, determines remedial action, and executes them by modifying the baseline controller. 


17.8.2 Active FTC Methods 


Active FTC systems compensate for the effects of a fault either by selecting a new precomputed control 
law (projection-based methods) or by synthesizing a new control law online (online automatic controller 
redesign methods) [21]. 

Gain scheduling (GS) [22], fuzzy decision logic [23], and structural analysis [3] are some of the 
possible ways to implement projection-based active FTC. Models and precomputed controllers for 
the system under nominal conditions and under the effect of the faults of interest are used during the 
design phase to grant the controller quick and correct responses to the envisioned scenarios. However, 


Supervisor 


State information 
R(t+1) 


~ —L2 | 


FIGURE 17.9 Generic active fault tolerant architecture depicting the base line controller and the supervisory 
system. In the figure, D represents a delay block, u(t) is the controlled input, and R(f) is output of the plant. 
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since fault information at least to the level of isolation is essential, it is necessary for the models of the 
faulty scenarios to be accurate enough to be distinguishable under the effect of noise and disturbances. 
Even when precision is not taken into account, the mere task of offline design of characteristic models 
for all possible fault scenarios is by itself a challenging one, especially if complex nonlinear plants are 
considered. 

Online automatic redesign methods are of particular interest in light of the goal of the proposed 
work due to its capability of providing specific control actions even to fault scenarios that had not been 
necessarily anticipated during the design phase. Reconfigurable control can be used to implement 
online redesign requiring only the residuals generated by fault detection. Nevertheless, the flex- 
ibility gained by this approach comes at the expense of slower response since the controller must be 
allowed time to learn the new dynamics and modify itself. Since the reconfigurable controller does not 
require knowledge of the dynamics of the system under the effect of each specific fault, it is inherently 
immune to modeling errors and possesses a greater potential to deal with unmeasured disturbances 
and noise-corrupted data. 

A reconfiguration approach in which the eigenstructure can be directly assigned to the close-loop 
system to achieve the desired system stability and dynamic performance is known as eigenstructure 
assignment (EA) [24]. The conditions for exact assignment are the existence of a sufficient number of 
actuators and measurements available and that the desired eigenvectors reside in the achievable sub- 
spaces. The limitations of EA are that the system performance may not be optimal in any sense, and that 
the system requirements are often not easily specified in terms of the eigenstructure [25]. 

The pseudo-inverse method (PIM), on the other hand, is a reconfiguration method that is optimal 
in the sense that it minimizes the Frobenius norm of the difference matrix between the original and 
the impaired closed-loop system transition matrices. Since in its initial formulation stability cannot be 
guaranteed, a modified pseudo-inverse method (MPIM) was proposed [26]. In its initial formulation, 
however, MPIM required full state feedback and relied on stability bounds that could give very conser- 
vative results. Those limitations were the focus of Ref. [27], where the problem was reevaluated from an 
optimization point of view while focusing on FTC application. Although the state feedback constraint 
was relaxed to output feedback, the method still requires residuals for each parameter of the model to 
be generated (comparison between transition matrices), limiting in the reconfigurable fault scenarios to 
those with the same dynamic structure than the nominal mode. 

However, both EA and PIM-based controllers are restricted to implementation on linear models. 
When a fixed dynamical nonlinear structure is available and only the parameters are unknown, adap- 
tive control can be used. Even to this restricted case, the assumptions that have to be made concerning 
the unknown plant to develop a stable adaptive controller were established only in the 1980s [28]. The 
problem becomes truly formidable when the plant is nonlinear and the input-output characteristics are 
unknown and time varying. 

From a system theoretic point of view, artificial NNs can be considered as practically implementable 
parametrizations of nonlinear maps from one finite dimension space to another. Theoretical works by 
several researchers have proven that, even with one hidden layer, NNs can uniformly approximate at 
any degree of precision any piecewise continuous function over a compact domain, provided the net- 
work has a sufficient number of units, or neurons. Therefore, NN can, by their very nature, cope with 
complexity, uncertainty, and nonlinearity, and NN have been used successfully to identify and control 
nonlinear dynamic systems [29]. 

Multilayer neural networks (MNN) and radial basis functions networks (RBFN) have proven 
extremely successful in pattern recognition problems, while recurrent neural networks (RNN) have 
been used in associative memories as well as optimization problems [30]. From the theoretic point 
of view MNN and RBEN represent static nonlinear maps while RNN are represented by nonlinear 
dynamic feedback systems [13]. 

In Ref. [31], a recurrent high-order neural network (RHONN) was developed with the goal of identi- 
fication of dynamical systems displaying similar convergence properties of classical adaptive and robust 


© 2011 by Taylor and Francis Group, LLC 


17-18 Intelligent Systems 


adaptive schemes. A Lyapunov-based approach is used to prove the convergence property of the learn- 
ing algorithm that ensures that the identification error converges to zero exponentially and that, if it is 
initially zero, it remains in zero during the whole identification process. Later, in Ref. [13], the identifi- 
cations capabilities of the RHONN were used to provide state information to a sliding mode controller 
to solve a tracking problem. However, the RHONN displays serious restrictions to its applicability to 
complex systems due to a lack of scalability in its heavily connected architecture. 

In Ref. [32], a simplified RNN is used to identify the system and its parameters used as input to a 
controller based on feedback linearization and pole placement. Stability though, is only assured if the 
controlled system remains stable, a limitation that greatly decreases the applicability of the method to 
the FTC problem. 

A RNN-based adaptive controller specially developed to deal with nonlinear systems with unknown 
dynamics is presented in Ref. [33]. In the proposed configuration, the output from the RNN adaptive 
controller was applied to the system summed with the output of a linearizing controller designed offline 
to deal with the nonlinearities in the nominal model. The proposed learning algorithm was stable in the 
Lyapunov sense, but the restrictions applied to achieve such proof make this approach capable only to 
deal with incipient faults. 

In order to achieve semi-global boundedness of all signals in a control loop of a MIMO system, a 
backstepping approach is used in Ref. [34] to divide the MIMO nonlinear model into a series of SISO 
nonlinear models and design controllers separately using RBFNs. However, in order to achieve such 
degree of decouplability, it must be possible to describe the system in block-triangular form. Even if 
true for the nominal model, a fault may increase relationships between states that could previously be 
ignored, making it impossible for the system to fit in a block-triangular form again. 

Taking inspiration in a PID controller, a modified RNN architecture is applied in a model refer- 
ence adaptive control framework to control an automotive engine in Ref. [35]. Although identification 
and control are performed by RNNs, the identification is performed offline while only the controller is 
trained online. Therefore, direct application of this method to systems which dynamics may be affected 
by faults in unexpected ways is not possible. 


17.8.3 Multiple Model as a Framework 


Even though a reconfigurable adaptive controller is a key element without which solutions for unknown 
faults cannot be designed online, if used as a FTC architecture alone, it displays two major limita- 
tions. The first involves the fact that a reconfigurable controller makes it impossible for any available 
fault knowledge to be incorporated during design time. Although an ideal reconfigurable controller 
will always reach a solution (given its existence) for a given fault scenario, the amount of time it must be 
allowed to learn the new dynamics and modify itself accordingly could be greatly reduced by the direct 
application of a known solution. The second major limitation is caused by the known tradeoff between 
adaptability and long-term memory. As a reconfigurable controller is optimized to deal with a broader 
scope of faults with minimum reconfiguration time, previously configured controllers are forgotten and 
the reconfiguration process has to be repeated even when returning to the healthy condition from an 
intermittent fault scenario. 

Multiple models architecture (MMA) [23,29] presents a framework in which projection-based meth- 
ods and online redesign can be synergistically integrated to provide the fast and specific response of the 
first combined with the flexibility and robustness of the second. More specifically, in Refs. [9,36] it was 
shown that implementing a reconfigurable controller in a MMA has the potential to overcome the cited 
limitations for the tracking of complex nonlinear plants. Since then, MMA has been applied to FTC by 
combining fault scenarios and their respective control solutions in model banks coordinated by a super- 
visor. However, most publications so far are based on fixed model banks built offline and therefore are 
incapable of improving the controller response in the reoccurrence of faults that were unexpected dur- 
ing design time. In Ref. [37], a dynamic model bank (DMB) is used to allow the insertion of new plant 
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dynamics as they were identified online, but the use of a linear controller and the lack of a complete fault 
detection and diagnosis (FDD) scheme significantly limit its applicability. 

To better understand the MMA approach, its simplest implementation, GS, will first be introduced 
and discussed. GS is a technique that aims to provide control over nonlinear systems without requiring 
the design of nonlinear controllers. The first step in GS is to linearize the model about one or more oper- 
ating points. Then linear design methods are applied to the linearized model at each operating point in 
order to arrive at a set of linear feedback control laws that perform satisfactorily when the closed-loop 
system is operating near the respective operating points. The zone of the state space where a controller 
still performs satisfactory is denoted operating region. The final step is the actual GS, which is intended 
to handle the nonlinear aspects of the design problem. The basic idea involves interpolating in some 
way the linear control law designs at intermediate operating conditions. It is usual in GS applications to 
choose a particular structure for the linear controllers (e.g., PID) and therefore its parameters (gains) are 
modified (scheduled) according to the states of the closed-loop system. 

In addition to the evident simplicity brought by the design of the controllers for linear approxima- 
tions instead of the global nonlinear models, GS also provides the potential to respond rapidly to chang- 
ing operating conditions and its real-time computational burden is light [22]. However, since the design 
process of GS in its original formulation is based only on local information of a limited set of operation 
points, no global characteristic (stability, performance, robustness, etc.) can be guaranteed. In the same 
way that a well-designed set of linear controllers does not necessarily result in even a globally stable con- 
trol law for the nonlinear system, reachable nonlinear systems may provide uncontrollable linearized 
models, preventing GS to be applied at all. 

Advanced MMAs make use of local nonlinear models to design its controllers, resulting in poten- 
tially bigger operating regions for controller. Given enough information of the system, this property 
allows the main components of a system dynamics to be represented in a finite set of nonlinear models, 
making it possible to incorporate global stability, performance, and robustness requirements in the 
design phase of multiple models. Model predictive control, feedback linearization, and sliding mode 
[38] are examples of such methods. Another benefit from the use of nonlinear models and controllers is 
that it provides the possibility to dramatically reduce the total number of models, making it feasible to 
apply the MMA concept to systems with widely diversified complex dynamics. 

Nevertheless, independent from the linearity of the models used to generate the set of controllers, the 
quality of the end result of the application of a MMA approach is still largely affected by a wide range of 
design choices concerning how many to create, where to position, and how to interpolate the controllers 
designed at each operating point or region. 

For a better understanding and comparison between different 
approaches, the parameter space representation presented in Ref. [9] will 
be used. The parameter space (S) is an augmented version of the state space 
representation that includes “states” of the environment that contain infor- 
mation of sensors present in the plant used solely to extract fault informa- 
tion. Temperature, for example, can be considered an environmental state 
if the model of the plant does not take it into account directly, but as the 
temperature deviates from the nominal condition the dynamics of the plant 
FIGURE 17.10 Performing are altered. In the examples that follow, the parametric space is a bounded 
multiple model control with — ;egion that encompasses the physically achievable values of each state. 
sparsely distubuted: operat: For the sake of visualization, the following discussion will be held 
ing regions. O,-O, are oper- : : : : F 
ating) regions wound. seach with examples using two-dimensional parameter Spaces: Ans conclusions 

however are not limited to this particular case, being possible to apply all 


operating point. The system 
is originally in the position of the discussed methods in higher dimensional spaces directly. 


the parameter spaced marked Figure 17.10 shows a basic MMA setting where a set of controllers 
by the white star and follows is devised for some specific operating regions sparsely distributed in 
the depicted trajectory. the parameter space. Each of the operating regions (O,, O,, and O,) is 


© 2011 by Taylor and Francis Group, LLC 


17-20 Intelligent Systems 


generated around an operating point and limited by the range of the state space of the plant in which 
the corresponding controller performs with a satisfactory performance. In the dimensions of the 
parameter space that do not represent states of the plant, the operating regions represent the robust- 
ness of the controller. 

If the plant is in a position in the parameter space that is close enough to an operating point to be 
inside its operating region, it is reasonable to apply the respective precomputed control law. This is the 
case of the original position (white star) of the trajectory shown in Figure 17.10. However, variations in 
the set-point or the occurrence of faults may take the system to a point away from all operating points 
that were considered offline (black star) and the question of what control law to use is raised. As a matter 
of fact, since precise description of the operating regions is not often available in practice, such question 
may arise even while the plant is still inside the respective operating region. 

Perhaps the most intuitive approach, one of the ways to generate control laws for in-between oper- 
ating points, is to assign a mean of the parameters of the controllers at each operating point to the 
parameters of the active controller, weighting it by their geometrical distance with respect to the present 
position in the parameter space. The main critic to this method is that it does not take into account the 
nonlinear characteristic of the system that creates a heterogeneous parameter space. In Figure 17.10, for 
example, since the plant finds itself closer to O,, weighting the sum by the geometrical distance alone 
would result in a control law more similar to the one devised for that operating point. However, if a 
strong nonlinearity existed between O, and the present system position, the ideal control law may be 
more similar to those created for O, and O;. 

Among the techniques that have been researched aiming to overcome this limitation, some are of 
special interest to this study as they were specifically designed for FTC applications. In Ref. [23], a set of 
IF-THEN rules was used in a fuzzy logic framework to compare the present position in the parameter 
space with the symptoms of known faults. The degree of similarity with each fault scenario was then 
used to weigh the mean that adjusts the parameters of the controller. Assuming that knowledge is avail- 
able regarding the status of the system that make it prone to develop each expected fault, in Ref. [5] this 
approach was improved by applying the fuzzy algorithm only to the set of possible faults at a given posi- 
tion in the parameter space. 

A different approach was taken in Ref. [25] where the probability of occurrence of each expected fault 
was modeled in a finite-state Markov chain with known transition probabilities. With this information 
at hand, the mean of control parameters was weighted favorably to the most probable fault scenarios. 

Regardless of the weighting scheme chosen, it is still an approximation of the behavior of the system 
outside the considered regions of operation and as such it is inevitably susceptible to nonlinearities 
active outside those regions. One way to solve this deficiency is to generate closely connected models 
by dividing the whole parameter space into evenly spaced operating regions as shown in Figure 17.11a. 
The natural tradeoff of this method is that increased control performance tends to require controllers 


FIGURE 17.11 Closely connected multiple model implementations: (a) fixed size operating regions and (b) plant- 
dynamics-dependent operating regions. In the figure, each rectangular section represents an operating region. 
The system is originally in the position of the parameter spaced marked by the white star and follows the depicted 
trajectories. 
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designed for smaller, and therefore less complex, regions. This in turn causes the final number of models 
needed to cover the whole space to grow, requiring extensive design work since a control law has to be 
designed for each model. This relationship can be clearly seen ina series of simulation results performed 
in Ref. [36]. If made small enough, each region can be represented by a linear model given by the lin- 
earization of the nonlinear plant on the center of the operating region, making it possible to apply GS. 

Since all parameter space is covered by previously designed controllers, a basic closely connected 
MMA algorithm would be composed only of two steps: determine the present position of the plant in 
the parameter space and apply the corresponding controller. However, even though each controller is 
designed to provide a desirable behavior for the plant while inside its respective operating region, if no 
special procedure is performed, switching directly from one control law to another may cause all kinds 
of unwanted responses as the plant navigates from one operating region to another. In Ref. [9], a mini- 
mum time (or number of iterations) was set for permanence inside an operating region before switch- 
ing takes place, creating in this way a time-based hysteresis in an effort to prevent oscillations between 
adjacent operating regions. Another approach, requiring all controllers to possess the same structure, is 
to create an area on the border of adjacent operating regions in which the parameters of both controllers 
are combined causing one to gradually change to another. However, both methods are solely heuristic 
solutions and no proof of their efficiency, let alone deterministic way to configure their design param- 
eters, is available. A method that guarantees stability of systems when perform control switching has 
been presented in Ref. [39]. The referenced paper describes a way to compute a pretransition sub-region 
inside an operating region from which stability is assured when switching to another specific operat- 
ing region. In Ref. [36], an adaptive controller that operates in parallel with the MMA is used to assure 
stability of the system during the transient behavior generated by switching controllers. 

If complete information on the dynamics of the nonlinear system is available beforehand, it is pos- 
sible to divide the parameter space taking into account the sensibility of different areas (as shown in 
Figure 17.11b) and produce a combination of controllers with good performance based on a compact set 
of operating regions. It is important to notice that, independent of the number and uniformity of the 
regions, because no interpolation is fundamentally necessary, different control structures or strategies 
can be used for each region. From the point of view of FTC applications that consider the occurrence of 
unexpected faults, model weighting is not an attractive technique since there is no reason to assume that 
a new fault dynamic will hold any relationship with those previously known. When closely connected 
multiple models are considered, the quality of the response depends on the robustness of the design 
of each controller and the way the control laws are switched from one to another. Although in this 
formulation an active FTC is being performed for expected faults, no direct action can be defined for 
unexpected dynamics. At the same time that the requirement for robustness increases, since the areas 
of sensitivity are no longer available at design time, a large number of evenly spaced operating regions 
have to be created making the memory requirement and design effort increase greatly. 

It is therefore interesting to explore yet another way to apply the MMA concept in which controllers 
are designed online as new operating regions are reached [36]. Since no information about the param- 
eter space is supposed to be available at design time, nonlinear online identification is required in order 
to learn new operating regions (models) and recognize the ones to which a controller has already been 
designed. In this way, different from the previously discussed methods that adjust the controller based 
on the position of the plant in the parameter space, the online building of models achieves the same 
in an indirect manner by the identification error of the models designed so far. Therefore, if at a given 
moment the identification error of every known model (contained in a dynamic database) is high, the 
plant is considered to be in an unknown region of the parameter space, while if the error of one of the 
models is low, it indicates that the plant is inside the previously designed operating region. What is 
considered to be “high” or “low” depends on an identification threshold selected by the user. By reduc- 
ing this threshold, the operating region of each model shrinks, causing a greater number of models to 
be generated. In this sense, a parallel can be traced between the setting of the identification threshold 
and the choice of how many fixed models to have in the closely connected operating regions approach. 
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It is interesting to notice that, due to the indirect measuring through the dynamics of the plant, the 
operating regions now span in the space of the identifier, not in the parameter space. If a NN is used to 
approximate the plant dynamics, for example, the operating regions span in the dimension defined by 
its weights. Operating regions described in the parameter space possess all its dimensions with direct 
physical meaning since they came from sensor readings. Although this property can be highly desir- 
able in certain operations such as translating expert knowledge to the model database, changes in the 
dynamics are not always directly linked to the position on the environmental space. For example, a high 
temperature in a certain part of a system may not instantly incur in a fault, but may increase the prob- 
ability of its occurrence. Since the identifier focus on the change in the dynamics and not on the second- 
ary symptoms of faults, it does not suffer from such drawback. On the other hand, in order to extract 
information from expert sources it is necessary to duplicate the described conditions in simulation so 
that the identifier is able to produce a model in its own space. 

As with the identification models, the control laws must also be devised online. A single control strat- 
egy that modifies itself based on the identified models, such as approximate feedback linearization [28], 
is a valid approach for plants which dynamics do not present extreme nonlinearities. When it is not the 
case, highly flexible nonlinear adaptive controllers [29] may be applicable. 

If a new model is added to the database every time the identification threshold is exceeded, the 
area of the parameter space to which the system is exposed will be filled with closely connected 
models and therefore there is no need to use the same control structure for every operating region. 
Particular solutions previously known to exist to particular regions of the parameter space can then 
be directly introduced. For example, fuzzy logic can be used to extract expert knowledge on the solu- 
tion of a particular fault, while NNs are used to generate novel control laws to cope with unexpected 
fault scenarios. 

Sparsely connected model distribution can also be attained by the online MMA approach if a second 
threshold to measure model dissimilarity is created. The dissimilarity threshold, always greater than 
the identification one, indicates the regions in which the present dynamics of the plant are considered 
to be different enough from all the models in the database to justify the addition of a new model. Such 
scheme was implemented in Ref. [37] where the parameters of the controllers for regions not covered by 
the models in the database were adjusted by a mean of the known controllers weighted by the inverse 
of the identification error of their respective models. In this way, the control laws for regions between 
models hold more similarity to the ones devised for similar plant dynamics. 

Apart from the above-mentioned concerns involving the transient behavior of the system when 
switching is performed, the application of MMA to FTC harbors two other points that require care- 
ful consideration. The first of them is the fact that the task to link either the present location on the 
parameter space or the prediction errors of identification models to the occurrence of a particular 
fault represents a FDI process and as such is vulnerable in all issues outlined. The second point 
focus on FTC applications that require new models to be designed online in a continuously growing 
database. In such a scenario, the nonlinear adaptive controller is required to be at the same time: 
quick to converge, highly flexible, and possess guaranteed stability, often conflicting characteristics 
in practice. 


17.9 Case Studies 


In order to demonstrate the capabilities of the identification network and provide a better understand- 
ing of the fine interrelations between the supervisor and DHP controller, two numerical examples are 
exploited. In both examples, faults are simulated by instantly or gradually changing the model of the 
plant. To give a better insight to the challenge of each fault scenario, linear models of fixed order similar 
to those employed in Ref. [37] are used here. This information, however, is not used in any way during 
the design of the fault tolerant controller that continues to take the plant as possessing a generic non- 
linear model. 
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17.9.1 Identification on Using an RNN 


The goal of the following example is to display the capabilities of the single-layered recurrent network 
to perform the identification of linear difference systems. An input signal is supplied in the form of a 
fixed frequency sine wave that changes mean and amplitude only once during the simulation. Since in 
the final application the input to the plant generated by the actor network is not necessarily composed 
of a large range of frequencies, the input with a limited spectrum represents a challenging but possible 
scenario in practice. 

Four systems are presented in the sequence displayed in Table 17.2. The network is allowed 50s for the 
identification of the first model, 30s for the second, and 20s for the third. The fourth and final model is 
unstable and the applied sinusoidal input steeply drives the output to positive infinity. A variable learn- 
ing rate with maximum value of 0.004 is used. 

In Figure 17.12, the performance of the identifier can be seen. The small learning rate applied gener- 
ates a slow initial reaction, but the identification signal remains close enough to the true plant output 
throughout the simulation in spite of the changes in the range of input and plant dynamics from 
model 1 to model 3. As the fourth dynamic causes the output of the plant to grow steadily at increas- 
ing rates, it is not feasible for an identifier with a maximum learning rate to produce true identification 
indefinitely. Still the RNN-based identifier fulfils its goal until the output becomes 45 times larger 
than the normal range of operation. In the complete scheme, this would allow the actor network more than 
70 iterations to restructure itself in any way that would at least decrease the rate of divergence. 


TABLE 17.2 Sequence of Changes in the Dynamics of the Plant Applied 
for the Identification Example 


Start Time (ds) Plant Dynamics 
0 y(t) = 1.810 y(t -1)- 0.8187 y(t— 2) + 0.00566 u(t — 1) + 0.00566 u(t — 2) 
500 y(t) = 1.810 y(t -1)- 0.9000 y(t — 2) + 0.00566 u(t — 1) + 0.00566 u(t — 2) 
800 y(t) — 1.810 y(t -1)- 0.9048 y(t — 2) + 0.00242 u(t — 1) + 0.00234 u(t — 2) 
1000 y(t) = 1.919 y(t -l1)- 0.9048 y(t — 2) — 0.00242 u(t — 1) + 0.00234 u(t — 2) 
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FIGURE 17.12 Results of the identification simulation. Plant signals are displayed in solid lines and the identifica- 
tion network output in dashed lines. 
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17.9.2 FTC Using a GDHP Controller 


In this section, the GDHP adaptive critic controller is presented with the challenge of making a nonlinear 
MIMO plant (61) follow the sinusoidal trajectories described by (17.62). As faults will be introduced latter, 
the original dynamics will also be referred to as the nominal dynamics. This plant, as well as the tracking 
trajectories, was suggested in Ref. [9] as a simulation test bed for online nonlinear adaptive control. In the 
original paper, results were only shown after the controller was allowed to adapt for 500,000 iterations. 


“ie =05n Une O)+ (2 n Lei) op + fo + 2200 


1+ x2(t)u2(t) 1+ x7 (t) 
: x3(t) 
xp(t +1) = x3(t)(1+ sin(4x,(¢)))+ eee (17.61) 
x3(t +1) = (3 nu sin(2x,(t)))u2(C) 
xi(t) = 0.ssin( 2) + 0.ssin( 2) 
50 25 
(17.62) 


x3(t) = 0.25sin one +0.75sin ue 
50 25 


Note that although in this example we assume that all states are available (and therefore will be mapped), 
we are only interested to track two of the states. Therefore, the matrices for the utility function and 
identification goal are adjusted as shown in (17.63). 


1 0 1 0 1 0 
S=|0 1 Of, a-[) ' Q=|0 1 0 (17.63) 
0 0 0 


For all the results shown in this section, the learning rates for the IANN, AcNN, and CrNN were set, respec- 
tively, to B' = 0.01, B* = 0.001, and * = 0.04. For the CrNN training, the GDHP algorithm is set to combine 
HDP and DHP with equal weights, i.e., 1 = 0.5. For both inputs and states, tap delay lines of size 10 were used. 
The future horizon for the calculation of J(f) was set to approximately 50 iterations by having y = 0.9. 

Figure 17.13 shows a performance comparison of the GDHP programming algorithm running with 
20 hidden neurons and with 50 hidden neurons in each of the three NNs. The results indicate that the 


21: ‘ 
99.70 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 
x104 x10* 


FIGURE 17.13 Results of the application of GDHP with different number of hidden neurons for the online adap- 
tation of the nominal plant. The desired trajectories are plotted in dash-dotted lines, while the actual plant outputs 
are in solid lines. 
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GDHP algorithm managed to adapt the weights of the NNs so as to best follow the trajectories, given the 
power granted by the number of hidden neurons available. 

This first experiment demonstrates that GDHP is capable of achieving excellent results, provided that 
enough hidden neurons are available and that enough time is provided. It is also important to point out 
that the result shown in Figure 17.13 was achieved after 42,000 iterations, more than 10 times faster than 
the original paper that used a classical neural control design. 

In the next experiment, a fault is introduced at the iteration 32,000 by changing the nominal dynam- 
ics of the first state of the plant to the following: 


1.5x,(t)u,(t) 


x\(t +1) = 0.5x,(f) sin(x,(f)) + (4 + 1+ x2(Du2(£) 


x,(t) 
J» (t) + [si +2, 20] u(t) (17.64) 


As it can be seen in Figure 17.14, after a short transient where large spikes are generate, the GDHP con- 
troller manages to generate a control sequence that once again tracks the desired trajectories closely. 
Figure 17.15 gives an indirect idea of the amount of reconfiguration performed by depicting how differ- 
ent the control inputs for the nominal and fault scenarios had to be to achieve the same tracking. 

As pointed previously, the future horizon for J(¢) is adjusted directly y. This fact allows for a very 
simple way to compare the performance of the GDHP adaptive control algorithm with the classical 
neural control approach, with only IANN and AcNN discussed. As it can be seen in Equation 17.32, 
setting y = 0 reduces the training of the ACNN to a minimization of U(¢) only, while Equations 17.52 
through 17.54 show that the CrNN is reduced to an estimator of the utility function and its deriva- 
tives only. 

Without changing any other parameter, the experiment in which the fault is introduced at iteration 
32,000 was run again for y = 0. The sum of the absolute tracking error over two complete periods (1000 
iterations) was used as a performance indicator. The comparison of the evolution of such indicator dur- 
ing training for y= 0.9 and y = 0 is brought in Figure 17.16. The faster convergence of the GDHP con- 
troller clearly demonstrates the advantage of possessing a functional CNN over the classical approach. 
As a matter of fact, the classical approach with the same parameters could not even maintain stability 
of the plant after the fault is introduced. Although it is reasonable to argue that different parameters 
(in special different learning rates) might have produced better results for the classical approach, the 
presented result stands as an indication that GDHP might be a more stable (i.e., less affected by design 
parameters) NN control paradigm. 


-1 
0 1 2 3 4 5 6 4.24 4.26 4.28 4.30 432 434 436 4.38 
x10* x10* 


FIGURE 17.14 Performance of the GDHP controller as a fault changes the plant dynamic at iteration 32,000. 


A plot of the general development (left) and the tracking performance after 10,000 iterations following the fault 
occurrence (right). 
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FIGURE 17.15 Input sequences developed by the GDHP controller to track the desired states under nominal (left 


and fault (right) conditions. 
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FIGURE 17.16 Tracking performance comparison between GDHP adaptive critic and classical NN adaptive 
control design. 


Finally, to demonstrate the capabilities of the GDHP to deal with a plant that varies its dynamics 
under several different fault scenarios, the following dynamics were presented: 


Nominal—iterations 1-30,000 


ia : 1.5x,(t)m(t) x(t) 
x(t +1) = 0.9x,(t)sin(x,(t)) + 2 + Te eaiA set) u,(t) + fo + ay i “2 u,(t) 
: x;(t) 


x3(t +1) =(3+sin(2x,(¢))) u(t) 
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Fault 1— iterations 30,001—40,000 


“t+ Va0sasaca@y+ & 1a) u,(t) + Go 9) eh 


1+ x?(t)uz(t) 1+x7(t) 
é x3(f) 
xp(t +1) = x3(t)(1+ sin(4x,(¢)))+ ee (17.66) 
x3(t +1) =(3+sin(2x,(¢)))us(t) 
Fault 2—iterations 40,001-50,000 
x(t +1) = 0.5x,(t)sin(x,(t)) + 4u,(f) + [00 +2 lf) u,(t) 
1+ x;(t) 
. x;(t) 
x,(t+1)= x,(2)(1 + sin(4x,(¢))) + 20 (17.67) 
xy(t +1) =(5+sin(2x,(¢)))us(t) 
Fault 3— iterations 50,001—60,000 
ee 1Sx(us(t) x(t) 
x(t +1) = 0.5x,(t)sin(x,(t)) + [4 + az “et u,(t) + [00 +2 ar 2.) u(t) 
xXy(t +1) = x;(t) (17.68) 
x,(t +1) =(3+sin(2x,(¢)))us() 
Return to Nominal—iterations 60,001—-70,000 
- : 1.5x,(t)u,(t) x,(t) 
x(t +1) = 0.9x,(t)sin(x,(t)) + E + ewan ee) ) u,(t)+ [00 +2 77 o- u,(t) 
A x3(t) 
xp(t +1) = x3(f)(1+ sin(4x5(£)))+ eae (17.69) 


x,(t+1)= (3 4 sin(2x,(t)))u, (t) 


As depicted in Figures 17.17 and 17.18, the GDHP adaptive controller was capable of devising new non- 
linear controllers as it adapted online so as to maintain the trajectory as close as the desired as possible 
under different fault scenarios that modified the plant dynamics several times. 
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FIGURE 17.17 Trajectory tracking results as the dynamics are changed several times during the experiment. 


. 


x104 


FIGURE 17.18 Different control input sequences developed as each fault scenario was presented. 


17.10 Concluding Remarks 


In this chapter, a particular implementation of GDHP adaptive critic controller architecture was intro- 
duced as a solution to the FTC problem that composed the initial motivation. A series of computer 
simulations provided favorable results indicating that the GDHP architecture might hold the potential 
to be the superior ADC architecture. A complete and detailed mathematical description of the proposed 
GDHP algorithm used in the experiments was introduced in a format that should lead to easy imple- 
mentation in any matrix-oriented compiler. Finally, the applicability of the GDHP controller to the 
FTC problem was demonstrated in a comprehensive example where the plant dynamics were modified 
several times by the occurrence of different faults. 
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18.1 Introduction 


The self-organizing map (SOM) is a neural network paradigm for exploratory data analysis. The idea of 
the SOM was originally motivated by localized regions of activities in the human cortex, where similar 
regions react to similar stimuli. This model stems from Kohonen’s work [1] and builds upon earlier work 
of Willshaw and von der Malsburg [2]. As a data analysis tool, the SOM can be used at the same time 
both to reduce the amount of data by clustering and for projecting the data nonlinearly onto a lower- 
dimensional display [3]. Because of its benefits, the SOM has been used in a wide variety of scientific and 
industrial applications, such as image recognition, signal processing, and natural language processing. 
In the research community, it has received significant attention in the context of clustering, data min- 
ing, topology preserving, vector projection, and data visualization. 

The SOM is equipped with an unsupervised and competitive learning algorithm. It consists of an 
array of neurons placed in a regular, usually two-dimensional (2D) grid. Each neuron is associated with 
a weight vector (or prototype vector). Similar to other competitive networks, the learning rule is based 
on weight adaptations. In the original algorithm of SOM, only one neuron (winner) at a time is activated 
corresponding to each input. The presentation of each input pattern consists of a localized region of 
activity in the SOM network. During the learning process, a sufficient number of different realizations 
of the input patterns are fed to the neurons so that the neurons become tuned to various input patterns 
in an orderly fashion. The principal goal of the SOM is to adaptively transform an incoming pattern of 
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arbitrary dimension into the low-dimensional SOM grid. The locations of the responses in the array 
tend to become ordered in the learning process as if some meaningful nonlinear coordinate system for 
the different input features were being created over the network. This projection can be visualized in 
numerous ways in order to reveal the characteristics of the underlying input data or to analyze the 
quality of the obtained mapping [4]. 


18.1.1 Structure 


The neurons in a SOM are usually placed in a regularly spaced one-, two-, or higher dimensional grid. 
The 2D grid is most commonly used because it provides more information than the one-dimensional 
(1D) and is less problematic than the higher dimensional ones. The positions of the neurons in the grid 
are fixed, so they won't move during the training phase of the SOM. 

The neurons are connected to adjacent neurons by a neighborhood relation, which dictates the 
structure or topology of the map. The neurons most often are connected to each other via a rectangu- 
lar or hexagonal grid structure. The grid structures are illustrated in Figure 18.1, where neurons are 
marked with black dots. Each neuron has neighborhoods of increasing diameter surrounding it. The 
neighborhood size controls the smoothness and generalization of the mapping. Neighborhoods of dif- 
ferent sizes in both topologies are also illustrated in Figure 18.1. Neighborhood 1, the neighborhood 
of diameter 1, includes the center neuron itself and its immediate neighbors. The neighborhood of 
diameter 2 includes the neighborhood 1 neurons and their immediate neighbors. The map topology is 
usually planar but toroidal topologies [5] have also been used. Figure 18.2 illustrates these two types 
of topologies. 


18.1.2 Initialization 


In the basic SOM algorithm, the layout and number of neurons are determined before training. They 
are fixed from the beginning. The number of neurons determines the resolution of the resulting 
map. A sufficiently high number of neurons should be chosen to obtain a map with decent resolution. 
Yet, this number should not be too high, as the computational complexity grows quadratically with the 
number of neurons [6]. 


FIGURE 18.2 Two types of SOM topologies: (a) planar topology and (b) toroidal topology. 
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Each neuron in the SOM is associated with an n-dimensional weight vector 


Wi = [Wits Wias-s Wink” 


where 
n is the dimension of the input vectors 
T denotes the matrix transpose 


The weight vector is often referred to as the prototype vector. In this chapter, the terms weight vector and 
prototype vector are used interchangeably. Before the training phase, initial values are assigned to the 
weight vectors. Three types of network initializations are proposed by Kohonen [3]: 


1. Random initialization, where simply random values are given to weight vectors. This is the case if 
little is known about the input data at the time of the initialization. 

2. Initialization using initial samples, which has the advantage that the initial locations of the weight 
vectors lie in the same part of the input space as the data points. 

3. Linear initialization, where the weight vectors are initialized to lie in the linear subspace spanned 
by two largest eigenvectors of the input data. This helps to stretch the SOM to the orientation in 
which the input data set has the most significant amount of information. 


18.1.3 Training 


The SOM is an unsupervised neural network, which means the training of a SOM is completely data 
driven. No external supervisor is available to provide target outputs. The SOM learns only from the 
input vectors through repetitive adaptations of the weight vectors of the neurons. 

The training of the SOM is an iterative process. At each time step, one input vector x is drawn ran- 
domly from the input data set and presented to the network. The training consists of two essential steps: 


1. Winner selection 
This step is often called competition. For each input pattern, a similarity measure is calculated 
between it and all the weight vectors of the map. The neuron with the greatest similarity with 
the input vector will be chosen as the winning neuron, also called the best-match unit (BMU). 
Usually the similarity is defined by a distance measure, typically Euclidean distance. Therefore 
the winner, denoted as c, is the neuron whose weigh vector is the closest to the data sample in the 
input space. This can be defined mathematically as the neuron for which 


c =arg min{||x— w;||} (18.1) 


2. Updating weight vectors 
After the winner is determined, the winning unit and its neighbors are adjusted by modifying 
their weight vectors toward the current input according to the learning rule formulated as 


wi(t +1) =w,(t) + ath Cx) —w,()] (18.2) 


where 
x(f) is the input vector randomly drawn from input set at time t 
c(t) is the learning rate function 
h,{t) is the neighborhood function centered on the winner at time t 


This adaptation rule of the weights is closely related to the k-means clustering. The weight vector of each 


neuron represents a cluster center. Like the k-means, the weight of the best matching neuron (cluster center) 
is updated in a small step in the direction of the input vector x. However, unlike k-means, the winner and the 
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neurons surrounding it are updated instead of the winner alone. The size of the surrounding region is speci- 
fied by h,,(t), which is a non-increasing function of time and of the distance of neuron i from the winner c. 
Asa result of the update rule, the neuron whose weight vector is the closest to the input vector is updated to 
be even closer. Consequently, the winning unit is more likely to win the competition the next time a similar 
input sample is presented, while less likely to win when a very different input sample is presented. As more 
input samples are presented to the network, the SOM gradually learns to recognize groups of similar input 
patterns in such a way that neurons physically close together on the map respond to similar input vectors. 


18.1.4 Analysis of the Updating Rule 


The update rule in Equation 18.2 can be rewritten as 
w,(t +1) =[1— a(t) h,i(t)] wi (t) + oC) ha (t)x(t) (18.3) 


This equation characterizes the influence of data samples during training and directly shows how the 
parameters, a(t) and h,(f), affect the motion of w;. Every time a data sample x(t) is presented to the 
network, the value of x(f), scaled down by a(#)*h,,(), is superimposed on w; and all previous values 
x(t’), t/ = 0, 1, ..., f- 1, are scaled down by the factor [1 - a(#)*h,,()], which we assume is less than 1. 
The contribution of the data samples can be shown more clearly by rewriting Equation 18.3 into a 
non-iterative form. 

Given w((0) as the initial condition, Equation 18.3 can be transformed into the following form by 
iteratively substituting w,(t’) with w(t’ - 1), t’=t,t-1,..., 1, 


w(t +1) = A(t +1)w,(0) + SG +1,n)x(n) (18.4) 


n=l 


The coefficient A(é) describes the effect of the initial weight value on w,(t) and B(t,n) describes the effect 
of the data point presented at time n on w(t). Both A(f) and B(t,n) are functions of a(t)*h,,(t) and decrease 
with f. Equation 18.4 shows that w,(t+ 1), the weight vector at time t + 1, depends on a weighted sum of 
the initial condition and every data points presented to the network. w,(t + 1) can be therefore consid- 
ered as a “memory” of all the values of x(t’), t’ = 0, 1..., t. As the weight function B(t,n) is a function 
of a(t) and h,(t), the influence of a training sample on the final weight vector depends on the specific 
learning rate and neighborhood function used during the self-organizing process. 


18.1.5 Neighborhood Function 


The neighborhood function is a non-increasing function of time and of the distance of unit i from the 
winner neuron c. The form of the neighborhood function determines the rate of change around the 
winner neuron. The simplest neighborhood function is the bubble function, which is constant over 
the defined neighborhood of the winner unit and zero elsewhere. Using the bubble neighborhood 
function, every neuron in the neighborhood is updated the same proportion of the difference between 
the unit and the presented sample vector. 

Another widely applied, smooth neighborhood function is the Gaussian neighborhood function 


h,(t) = oof aE | (18.5) 


where 
o(t) is the width of the Gaussian kernel 
||r, -— r||? is the distance between the winner c and the neuron i with r, and r, representing the 2D 
positions of neurons c and ion the SOM grid 
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Usually, the radius of the neighborhood is large at first and decreases during the training. One commonly 
used form of o(f) is given by 


t 


o(t)= oto ( 12} ae (18.6) 


where 
6(0) is the initial neighborhood radius 
o(f) is the final neighborhood radius 
t max is the number of training iterations 


Therefore, 6(f) is a monotonically decreasing function of time. The decreasing neighborhood radius 
ensures that the global order is obtained at the beginning, whereas toward the end, the local corrections 
of the weight vectors of the map will be more specific. 


18.1.6 Learning Rate 


The learning rate o(¢) is a function decreasing with time. It can be linear, exponential, or inversely 
proportional to time. The linear learning rate function can be defined as 


max 


t 
a(t) = o.(0) {HE (18.7) 
where (0) is the initial learning rate. A commonly used exponentially decreasing function is given by 


a(t) = aio( 22) ie (18.8) 


where a(f) is the final learning rate. A function inversely proportional to time is given with the form 


1 1-— Nit max 
> m= 
m(t—-1)+1 N -Nlt max 


a(t) = (18.9) 


where N is the total number of neurons. Using the learning rate function in Equation 18.9 ensures that 
earlier and later input samples have approximately equal effects on the training result. 

The learning rate and the neighborhood function together determine which neurons and how much 
these neurons are allowed to learn. These two parameters are usually altered during training through 
two phases. In the first phase, namely the ordering phase, relatively large initial learning rate and neigh- 
borhood radius are used. The parameters keep decreasing with time. During this phase, a comparatively 
large number of weight vectors are to be updated and they move in big steps toward the input samples. 
In the second phase, the fine tuning phase, both parameters start with small vales from the beginning. 
They continue to decrease but very slowly. The number of iterations for the second phase should be much 
larger than that in the first phase, as the tuning usually takes much longer [7]. 

As a result of the learning rule, the neuron whose weight vector is the closest to the input vector is 
updated to be even closer. Consequently, the winning unit is more likely to win the competition the next 
time a similar input sample is presented, while less likely to win when a very different input sample is 
presented. As more inputs are presented to the network, the SOM gradually learns to recognize groups 
of similar input patterns in such a way that neurons physically close together on the map respond to 
similar input vectors. 
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18.2 Dynamic SOM Models 


In spite of the widespread use of the SOM, some shortcomings have been noted, which are related to 
the static architecture of the basic SOM model. First of all, the number of neurons and the layout of the 
neurons (i.e., the topology) have to be determined before training. The need for predetermining a fixed 
network structure is a significant limitation on the final mapping [8-11]. To address the issue of static 
SOM architecture, several variations based on the classical SOM have been developed recently. These 
dynamic SOM models usually employ an incremental growing architecture to cope with the lack of 


prior knowledge about the number of map units. Some of the models are summarized in this section. 


18.2.1 Growing Cell Structure 


One of the first models of such kind is the growing cell structures (GCS) [8]. In the GCS, the basic 2D grid 
of the SOM is replaced by a network of nodes whose basic building blocks are triangles. Starting with a 
triangle structure of three nodes, the algorithm both adds new nodes to and removes existing nodes from 
the network during the training process. The connections between nodes are adjusted in order to maintain 
the triangular connectivity. A local error measure is used to decide the position to insert a new node, which 
is usually between the node with the highest accumulated error and its most distant neighbor. The algo- 
rithm results in a network graph structure consisting of a set of nodes and the connections between them. 


18.2.2 Growing Neural Gas 


In addition to the GCS, Fritzke has also proposed the growing neural gas (GNG) [9] and the growing 
grid (GG) [10]. The GNG algorithm combines the GCS and the Neural Gas algorithm [12]. It starts with 
two nodes at random positions, and as in GCS, new nodes are inserted successively to support the node 
with high accumulated errors. Unlike the GCS, the GNG structure is not constrained. The nodes are 
connected by edges with a certain age. Once the age of an edge exceeds a threshold, it will be deleted. 
After a fixed number of iterations, a new node is added between the node with the highest accumulated 
error and the one with maximum accumulated error among all its neighbors. As an alternative form of 
growing network, the GG starts with 2 x 2 nodes, taking advantages of a rectangular structured map. 
The model adds rows and columns of neurons during the training process, and therefore is able to auto- 
matically determine the height/width ratio suitable for the data structure. The heuristics used to add 
and remove nodes and connections are the same as those used in GCS. 


18.2.3 Incremental Grid Growing 


Another approach is the incremental grid growing (IGG) [13]. Starting from a small number of initial 
nodes, the IGG generates new nodes only at the boundary of the map. This guarantees that the IGG net- 
work will always maintain a 2D structure, which results in easy visualization. Another feature of IGG is 
that connections between neighboring map units may be added and removed according to a threshold 
value of the inter-unit weight differences. This may result in several disconnected sub-networks, which 
represent different clusters of input patterns. The growing self-organizing maps (GSOM) [11], in similar 
spirit as IGG, introduce a spread factor to control the growing process of the map. 


18.2.4 Other Growing Structure Models 


Other modified models have also been proposed, including the plastic self-organizing maps (PSOM) 
[14], the grow when required (GWR) [15], etc. Figure 18.3 shows the simulation results of the original 
SOM and some of the dynamic models discussed above, which are given in Ref. [16]. The simulation 
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FIGURE 18.3 Simulation results of different models after 40,000 adaptation steps: (a) the SOM, (b) GCS, 
(c) GNG, and (d) GG. The distribution is uniform in the shaded area. Map units are denoted by circles. 


results are generated using 40,000 input signals from a probability distribution that is uniform in the 
shaded area. The growing versions of the SOM aim at achieving an equal distribution of the input pat- 
terns across the map by adding new nodes near the nodes that represent an unproportionally high 
number of input data. 


18.2.5 Hierarchical Models 


Beside the limitation of the fixed structure, another deficiency of the classic SOM is the incapabil- 
ity of capturing the hierarchical structure commonly present in real-world data. The structural 
complexity of such data sets is usually lost during the mapping process by means of a single, low- 
dimensional map. In order to handle a data set with hierarchical relationships, hierarchical models 
should be used. These models try to organize data at different layers by displaying a representation 
of the entire data set at a top level and allowing the lower levels to reveal the internal structure 
of each cluster found in the higher-level representation, where such information might not be so 
apparent [17]. 

The hierarchical feature map [18] uses a hierarchical setup of multiple layers, where each layer is 
composed of a number of independent SOMs. Starting with one initial SOM at the top layer, a separate 
SOM is added to the next layer of the hierarchy for every unit in the current layer. Each map is trained 
with only a portion of the input data that is mapped onto the respective unit in the higher layer map. 
The amount of training data for a particular SOM is reduced as the hierarchy is traversed downward. As 
a result, the hierarchical feature map requires a substantially shorter training time than the basic SOM 
for the same data set. Moreover, it may be used to produce fairly isolated, or disjoint, clusters of the input 
data, while the basic SOM is incapable of performing the same [19]. 

Another hierarchical model, the hierarchical self-organizing map (HSOM) [20], focuses on speed- 
ing up the computation during winner selection by using a pyramidal organization of maps. However, 
like the hierarchical feature map, while representing the data in a hierarchical way this model does not 
provide a hierarchical decomposition of the input space. 

As an extension to the Growing Grid and the hierarchical SOM models, the growing hierarchical 
self-organizing map (GHSOM) [21] builds a hierarchy of multiple layers, where each layer consists of 
several independent growing SOMs. Starting from a top-level SOM, each map grows incrementally to 
represent data at a certain level of detail in a manner similar to the GG. In GHSOM, the level of detail is 
measured in terms of the overall quantization error. For every map unit in a level, anew SOM might be 
added to a subsequent layer if this unit represents input data that are too diverse and thus more details 
are desirable for the respective data. 

Once the training process is over, visual display of the map must be carried out in order for the 
underlying structure of data to be perceived. A variety of visualization techniques based on the SOM 
have been developed, which will be reviewed in the next section. 
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18.3 SOM Visualizations 


Visualization potentiality is a key reason to apply the SOM for data analysis. Once the learning phase 
is over, visual display of the map can be carried out in order for the underlying structure of data to be 
observed. Extracting the visual information provided by the SOM is one of the primary motivations of 
this study. 

The visualization of SOMs is motivated by the fact that a SOM achieves a nonlinear projection of 
the input distribution through a commonly 2D grid. This projection can be visualized in different 
ways by a variety of techniques. Some of them visualize the input vectors directly, whereas others 
take only the prototype vectors (or weight vectors) into account. Based on the object that is visual- 
ized, these techniques can be divided into several categories, which are reviewed in the remainder of 
this section. 


18.3.1 Visualizing Map Topology 


One category of the visualization techniques is to visualize the SOM topology through distance 
matrices. The most widely used method in this category is the unified distance matrix (U-matrix) [22], 
which enables visualization of the topological relations between the neurons in a trained SOM. The idea 
is to show the underlying data structure by graphically displaying the inter-neuron distances between 
neighboring units in the network. The distances of the prototype vector of each map unit to its imme- 
diate neighbors are calculated and form a matrix. The same metric is used to compute the distances 
between map units, as is used during the SOM training to find the BMU. By displaying the values in 
the matrix as a three-dimensional (3D) landscape or a gray-level image, the relative distances between 
adjacent units on the whole map becomes visible. The U-matrix is calculated in the prototype space and 
displayed using the map space. 

A simplified approach is to calculate a single value for each map unit, such as the maximum or the 
sum of the distances to all immediate neighbors, and use it to control height or color [23]. High values 
in the U-matrix encode dissimilarity between neighboring units. Consequently, they correspond to 
cluster boundaries and are marked by mountains in a 3D landscape or dark shades of gray in a coloring 
scheme. Low values correspond to similarity between neighboring units, resulting in valleys or light 
shades of gray. A demonstration of the U-matrix is presented in Figure 18.4, which is based on a 10 x 10 
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FIGURE 18.4 U-matrix presentations of a 10 x 10 rectangular SOM: (a) a gray-level image and (b) a 3D plot. 
The Iris data set is used to train the SOM. 
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rectangular SOM. The Iris data set [24] is used to train the SOM. Two essential clusters can be observed 
from both the gray-level presentation and the 3D landscape presentation. 


18.3.2 Visualizing Data Density 


Recently, a density-based visualization technique, the P-matrix [25,26], has been introduced, which 
estimates the data density in the input space sampled at the prototype vectors. The P-matrix is defined 
analogously to a U-matrix. Instead of local distances, this technique uses density values in data space 
measured at the position of each prototype vector as height values, called P-heights. The estimate of 
the data density is constructed using pareto density estimate (PDE) [25], which calculates the density 
as the number of input data points inside a hypersphere (Pareto sphere) within a certain radius (Pareto 
radius) around each prototype vector. In contrast to the U-matrix, neurons with large P-heights are 
situated in dense regions of the data space, while those with small P-heights are in sparse regions. 
Illustrations of different visualizations of a SOM are given in Figure 18.5, taken from Ref. [27]. Figure 
18.5c shows the P-matrix of the Gaussian mixture data set in Figure 18.5a, where darker gray shades 
correspond to larger densities. Compared to the U-matrix presentation shown in Figure 18.5b, the 
P-matrix gives a complementary view of the same data set. 

A combination of the U-matrix and the P-matrix has also been proposed by Ultsch, namely the 
U*-matrix [26]. Commonly viewed as an extension to the U-matrix, it takes both the prototype vectors 
and the data vectors into account. The values of the U-matrix are dampened in highly dense regions, 
unchanged in regions of average density, and emphasized in sparsely populated regions. It is designed 
for use with Emergent SOMs [27], which are SOMs trained with a high number of map units compared 
to the number of data samples. U*-matrix is advantageous over the U-matrix in data sets with clusters 
that are not clearly separated. The U*-matrix presentation of the Gaussian mixture data set in Figure 
18.5d shows clearly two Gaussian distributions, which U-matrix fails to reveal. 
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FIGURE 18.5 Different visualizations of the SOM: (a) the original data set of a mixture of two Gaussians, 
(b) U-matrix presentation, (c) P-matrix presentation, and (d) U*-matrix presentation. 
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18.3.3 Visualizing Prototype Vectors 


An alternative way to visualize the SOM is to project the prototype vectors to a 2D output space using a 
generic projection method. Such methods include multidimensional scaling (MDS) [28] and Sammon’s 
mapping [29]. MDS is a traditional technique for transforming a data set from a high-dimensional space 
to a space with lower dimensionality. It creates a mapping to a usually 2D coordinate space, where object 
can be represented as points. The inter-point distances in the original data space are approximated 
by the inter-point distances of the projected points in the projected space. Accordingly, more similar 
objects have representative points that are spatially nearer to each other. The error function to be mini- 
mized can be written as 


Dia: (4s 7 di) 
vei (iy 


E= 


(18.10) 


where 
d, denotes the distance between vectors i and j in the original space 
d; in the projected space 


A gradient method is commonly used to optimize the above objective function. MDS methods are often 
computationally expensive. 

Closely related MDS, Sammon’s mapping also aims at minimizing an error measure that 
describes how well the pairwise distances in a data set are preserved [30]. The error function of 
Sammon’s mapping is 


E= 7 ; y ayy (18.11) 
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Compared to MDS, the local distances in the original space are emphasized in Sammon’s mapping. 
Since the mapping employs deepest descent procedure to minimize the error, it requires both first- and 
second-order derivatives of the objective function at each iteration. The computational complexity, as a 
result, is even higher than MDS [31]. 

Since the SOM provides a topology-preserving mapping of the input data, the MDS or Sammon’s 
projection of the SOM can be used as a rough approximation of the shape of the input data. Both of these 
nonlinear projection approaches are iterative and computationally intensive. However, the computation 
load can be alleviated to an acceptable level when applied to the prototype vectors of a SOM instead of 
the original data set, provided a much smaller number of map units are used compared to the input vec- 
tor number. The MDS projection and Sammon’s mapping of a SOM are given in Figure 18.6. The map 
units are visualized as black dots, which are connected to their neighbors with lines. In this example, 
the Iris data set is used to train a 10 x 10 rectangular SOM. Roughly two clusters can be seen from both 
projections of the SOM. Apparently, the setosa class is distinct from the data set, while the other two 
linearly inseparable classes, versicolor and virginica, are still joined in the projection space. 

In additional to the high computational cost, another drawback of MDS and Sammon’s mapping is that 
they do not yield a mathematical or algorithmic mapping procedure for previously unseen data points 
[31]. That is, for any new input data point to be accounted for, the whole mapping procedure has to be 
repeated based on all available data. Mao and Jain have proposed a feed-forward neural network to solve 
this problem, which employs a specialized, unsupervised learning rule to learn Sammon’s mapping [32]. 
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FIGURE 18.6 Different ways to visualize the prototype vectors: (a) MDS projection of a SOM and (b) Sammon’s 
mapping of a SOM. Neighboring map units, depicted as black dots, are connected to each other. 


18.3.4 Visualizing Component Planes 


The prototype vectors can also be visualized using the component plane representation. Instead ofa single 
plot, this technique provides a “sliced” version of the SOM, which shows the projection of each individual 
dimension of the prototype vectors on a separate plane [33]. The values of each component are taken 
from all prototype vectors and depicted by color coding. Each component plane shows the distribution 
of one prototype vector component. Similar patterns in different component planes indicate correla- 
tions between the corresponding vector components. This technique is hence useful when the correlation 
between different data features is of interest. However, one drawback of component planes is that cluster 
borders cannot be easily perceived. In addition, data with high dimensionality results in lots of plots. 
The component planes of the 10 x 10 rectangular SOM trained with the Iris data set is presented in 
Figure 18.7. The color scheme of the map units has been set so that the lighter the color is, the smaller 
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FIGURE 18.7 Component planes representation of a SOM trained with the Iris data set. The color bars beside 
each component plane show the maximum, mean, and minimum values and the corresponding colors. 
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FIGURE 18.8 Different presentations of the hit histogram: (a) a gray level image and (b) a 3D plot. The Iris data 
set is used to train the SOM. 


the component value of the corresponding prototype vector is. It can be seen, for instance, that the two 
components, petal length and petal width, are highly related. 


18.3.5 Visualizing Best Matching Units 


Another category of visualization is to display the BMUs of the input data set. Data vectors can be pro- 
jected on the map by locating their BMUs. Because the prototype vectors are ordered on the map grid, 
nearby map units will have similar data projected to them. Projecting multiple input vectors will result 
in a histogram of the BMUs. For each data vector, the BMU is determined and the number of hits for 
that map unit is increased by one. The hit histogram shows the distribution of the data set on the map. 
Map units on cluster borders often have very few data samples, which imply very few hits in the his- 
togram. Therefore, low-hit units can be used to indicate cluster borders. The values of a histogram can 
be depicted in different ways. Figure 18.8 illustrates the gray level presentation and the 3D presentation 
of a hit histogram. In Figure 18.8a, the darker the gray shade is, the higher the hit value of that unit is. 
In Figure 18.8b, the height directly corresponds to the value of the histogram. 

However, hit histograms consider only the BMU for each data sample while real-world data is usually 
well represented by more than one unit. This inevitably causes distortions in the final map. A variation 
of the standard hit histogram, namely the smoothed data histogram (SDH) [34], has been developed 
counting the data sample’s relativities to more than one map unit. The SDH allows a data sample to 
“vote” not only for the BMU but also for the next few good matches based on the ranks of distances 
between the data sample and the corresponding prototype vectors. 


18.3.6 Other Visualizations 


Aside from the above categories, other visualization techniques are also available for SOM. A rather 
different way to project the prototype vectors, the so-called adaptive coordinates [35], was proposed 
with a focus on cluster-boundary detection. This approach mirrors the movements of prototype vectors 
during the SOM training within a 2D “virtual” space, which is used for subsequent visualization of the 
clustering result. The initial positions of the prototype vectors are defined by the network structure, 
which are on top of the junctions of the map grid. The coordinates of the prototype vectors are adapted 
during the training. After convergence of the training process, the prototype vectors can be plotted in 
arbitrary positions in the projected space according to their coordinates. The algorithm offers an extension 
to both the basic training process and the fixed grid representation. 

Another extended SOM model, called the visualization-induced SOM (ViSOM) [36], has been devel- 
oped to directly preserve the distance information along with the topology on the map. The ViSOM 
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updates the weight of the winning neuron using the same learning rule as the SOM. For the neighboring 
neurons, the weight adaptation is decomposed into two parts: a lateral movement toward the winner 
and an updating movement from the winner to the input vector. ViSOM places a constraint on the 
lateral contraction force between the neurons and hence regularizes the inter-neuron distances. As a 
result, the inter-neuron distances in the data space are in proportion to those in the map space. A scal- 
able parameter A is introduced in the constraint that controls the resolution of the map. Ifa high resolu- 
tion is desirable, a small 4 should be used, which will result in a large map. 


18.4 SOM-Based Projection 


Several challenges remain when using the SOM for visualizing document databases. First, the shape 
of the grid and the number of nodes have to be predetermined. This requires prior knowledge of the 
input data characteristics, which is usually unavailable before analysis. Second, the underlying hier- 
archical relations can hardly be detected by a single map. Such relations are commonly observed in 
document collections and, thus, their proper identification is highly desirable. A further limitation, 
which occurs when using the SOM projection, is that the map resolution depends solely on the size 
of the map. To have a high-resolution document map, which is desirable in most cases, it requires a 
considerably large number of neurons. To achieve a better visualization, a high-resolution SOM may 
even call for a higher number of neurons than that of input vectors [37]. As a result, the size of the 
SOM will become impractically huge when dealing with large data sets. The computational complex- 
ity grows quadratically with the number of neurons [38]. As a result, training huge maps may be 
exceedingly time-consuming. 

To resolve the above limitations, a SOM-based visualization approach has been proposed [39]. Figure 
18.9 shows the schematic diagram of the proposed approach. First, a similarity matrix is derived from 
the collection of documents of interest. The similarity matrix is then used to train a GHSOM, which 
clusters document items in a hierarchical manner and at the mean time allows for adaptation of the 
network architecture during training. Following the training of the GHSOM, a novel projection 
technique, the ranked centroid projection (RCP), is used to project the input vectors to a hierarchy of 2D 
output maps. Using the proposed approach, a hierarchy of multiple data projections can be achieved with 
comparatively low computational cost. 
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FIGURE 18.9 The schematic diagram of the proposed SOM-based approach. 
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18.4.1 User Architecture 


The typical goal of document clustering and visualization is to discover subsets of large document 
collections that correspond to individual topics. Additionally, it can be applied hierarchically, yielding 
more refined groups within clusters. This leads to a large-to-small-scale presentation of the conceptual 
structure of the document collection, in which large-scale clusters correspond to more general topics 
and smaller scale ones correspond to more specific topics within the general topics. Cluster hierarchies 
thus serve as topic hierarchies [40]. To detect this hierarchical structure, the GHSOM [21] is employed in 
the proposed approach. The GHSOM combines the advantages of two principal extensions of the SOM, 
dynamic growth and hierarchical structure. As depicted in Figure 18.10, the GHSOM evolves to a multi- 
layered architecture composed of independent growing SOMs. At layer 0, a single-unit SOM serves as 
a representation of the complete data set. Only one map is used at the first layer of the hierarchy, which 
initially consists ofa grid of 2x2 units. For every unit in this map, a separate SOM can be added to the 
second layer. The model grows in two dimensions: in width (by increasing the size of each SOM) and in 
depth (by increasing the levels of the hierarchy). For growing in width, each SOM attempts to modify 
its layout and increase its total number of units systematically so that each unit is not representing too 
many input patterns. The basic steps are summarized in Table 18.1. 

As for growing in depth, the general idea is to form a new map in the subsequent layer for the units 
representing a set of input vectors that are too diverse. The basic steps for the growth in depth are 
summarized in Table 18.2. 

The growing process of the GHSOM is guided by two parameters, T, and T,. 1, specifies the desired 
quality of input data representation at the end of the training process while T, specifies the desired level 


FIGURE 18.10 Graphic representation of a trained GHSOM. 


TABLE 18.1 Steps of the Growth in Width 


1. Initialize the weight of each unit with random values. Reset error variables E; for every unit / 

2. The standard SOM training algorithm is applied 

3. For every input vector, the quantization error (qe) of the corresponding winner is measured in terms of the 
deviation between its weight vector and the input vector. Update the winner's error variable by adding the 
ge to E; 

4. After a fixed number A of training iterations, identify the error unit q with the highest E; 


. Insert a row or a column between the error unit q and its most dissimilar neighboring unit in terms of input 
space 


WI 


6. Repeat steps 2-5 until the whole map’s mean quantization error (MQE,,) reaches a given threshold so that 
MQE,, < T,ge,, is satisfied, where ge, is the quantization error of the corresponding unit u in the proceeding 
layer of the hierarchy and t, is a fixed percentage 
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TABLE 18.2 Steps of the Growth in Depth 


1. When the training of a map is finished, every unit is examined and those units fulfilling 
the criterion given as qe, > T,qe, will be subject to a hierarchical expansion. qe, is the 
quantization error of the single unit in the layer 0 


2. Train the newly added SOM with input vectors mapped to the unit map, which has just been 
expanded 


of detail that is to be shown in a particular SOM. The smaller 1, is, the larger the emerging maps will be. 
Conversely, the larger T, is, the deeper the hierarchy will be. 


18.4.2 Rank Centroid Projection 


The objective of the proposed RCP is to map input vectors onto the output space based on their similarities 
to the prototypes. Although this method is based on the standard SOM architecture [41,42], its applica- 
tion to a GHSOM is straightforward. Once the training process of the GHSOM is complete, a hierarchy 
of multiple layers consisting of several independent SOMs will be formed. The RCP can be applied to each 
individual map in the GHSOM afterward. 

For each individual SOM in a GHSOM network, a set of prototype vectors is tuned and becomes 
topologically ordered during the training phase. The prototypes can be interpreted as cluster centers, 
while the coordinates of each map unit i indicate the position of the corresponding cluster center within 
the grid of the map. After convergence of the training process, for any input vector x, its similarity 
to each prototype vector can be calculated. A similarity measure can be defined as the inverse of the 
Euclidean distance between the respective vectors: 


sy =dy' = |x;-—w,||* (18.12) 


where 
s, is the similarity value 
dj is the distance between x, and w, 


The map unit with the smallest distance to an input vector x, is the BMU, as described in Equation 18.1. 
The BMU has the greatest similarity with x, and corresponds to a cluster to which x, is the most closely 
related. Hence, x; should be projected to a position closer to the BMU than to other units. In a winner- 
takes-all case, in which only the BMU is considered, the data sample is to be mapped directly to its 
BMU. The coordinates of x, and Cx, can be represented as 


Cx; =Cw, (18.13) 


where Cw, represents the coordinates of the BMU. ‘The resulting mapping is illustrated on the right side 
of Figure 18.11. Projecting multiple data samples will result in a hit histogram. 


FIGURE 18.11 _ Illustration of mapping an input vector to its BMU. 
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FIGURE 18.12 _ Illustration of mapping an input vector when two units are considered. 


However, in most cases, there are usually several units that have almost as good matches as the BMU. As 
a result, pointing out only the BMU does not provide sufficient information of cluster membership, which 
is the problem with hit histograms. Intuitively, the data item should be projected to somewhere between 
the map units with a good match. Analogously, each map unit exerts an attractive force on the data item 
proportional to its response to that data item. The greater the force is, the closer the data item attracted to 
the map unit. The data item will end up in a position where these forces reach an equilibrium state. 

In the projection process described in Figure 18.11, if W,, the BMU and W,, the second winning unit, 
are taken into account, intuitively the data sample should be projected to a position so that it is between 
these two units while closer to the BMU. This is illustrated in Figure 18.12, where d, and d, are the 
Euclidean distances between the data sample and two winners W, (BMU) and W,. The responses of W, 
and W, to the data sample are, therefore, inversely proportional to d, and d,. The projection of the data 
sample can be decided by the following weighted sum: 


d;' d;' 
Cx; = = Cw, + —?— Cw, (18.14) 
1 +d; dy +d; 
where Cw, is the coordinates of the second winner W,,. 
Similarly, in the case of three winners, the coordinates of x, are calculated as: 
-1 -1 -1 
Cx, = dy Cw, +d, Cw, + ds" Cws (18.15) 


dj! +dz' +d4;* 


where Cw, are the coordinates of the third winner. This can be extended to include all N units in the 
map grid in the computation of the projections. In general, the coordinates of x, can be assigned by 
the following function: 


N 4) 
pk ) ae) 


Cx; = = -1 
=) Da) 
Cw; if d, =0 


» ifd;#0 forallj 18.16) 


where 
d, is the distance between xi and prototype vector x, 
The inverse distance (d,)~! is used as a measure of the similarity 


As shown in Equation 18.16, the projected position of the data sample x, is computed as a weighted 


average of the positions of nearby map units. The weighting is based on the distance d, between x, 
and the map unit. As d, is inversely proportional to the similarity between the data sample x, and 
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FIGURE 18.13 Illustration of mapping an input vector by finding the centroid of the spatial responses. 


prototype vectors w, the weighting factors indicate the normalized responses of x; to various proto- 
types. Since the map units are arranged in a rectangular grid, the set of weights may be characterized 
as a 2D histogram plotted across the map units as illustrated in Figure 18.13. The SOM projection 
procedure continues with finding the centroid of this spatial histogram, where the data sample is 
then mapped. 

The basic centroid projection method [41,42] finds the projections in the output space by taking into 
account all N map units and calculating the weighted average, as shown in Equation 18.16. To enhance 
the performance of the projection method, the basic weighting function in Equation 18.16 is subject to 
modifications and correction terms. Instead of mapping the data sample directly onto the centroid of the 
spatial responses of all map units, a ranking scheme is applied to the weighting function. First, a constant 
Ris set to select only a number of prototypes that are nearby the input vector in the input space. Only the 
positions of the associated R units will affect the calculation of the projection. R is in the range of one to 
the total number of neurons in the SOM. A membership degree of a data sample to a specific cluster is then 
defined based on the rank of closeness between the data vector and the unit associated with that cluster, 


which is given by 
R . 
rm for the closest unit 
—, for the 2nd closest unit 
m=)... (18.17) 
= for the Rth closest unit 
0 for all other units 


where S = y (R—i) ensures a normalized membership. 


For a data sample x; the new weighting function is defined by applying the membership degree mi 
to Equation 18.16: 


Str ifdy 40 forall j 
Cx, = y mdz! (18.18) 
ij 
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With the new weighting function, the spatial histogram shown in Figure 18.13 is first ranked by the 
corresponding membership degrees, whose centroid is then located. The projection method proceeds 
with mapping the data sample to this position. 

The ranking scheme has a beneficial effect on the performance of the projection method. It introduces 
a membership degree factor into the weighting function in addition to the distance factor, which enables 
the proposed projection technique not only to reveal the inter-cluster relation of the input vectors but 
also to visualize information on cluster memberships. A positive side benefit of the ranking scheme 
is a considerable saving in computation as the result of selecting only the R closest units. This saving 
becomes more significant when the map size is large. 

As discussed earlier, for each individual map in a trained GHSOM, the RCP is applied and the data 
samples are projected onto the map space. The data samples used for training in each layer are a fraction 
of the input data in the preceding layer. The projection process will result in multiple layers of rather 
small maps showing different degrees of details. Because the RCP algorithm allows the data points to be 
projected to any location across the SOM network, it can handle a large data set with a rather small map 
size and provide a high-resolution map in the mean time. Therefore, the presented procedure of mapping 
input vectors to the output grid using the RCP alleviates computational complexity considerably, making 
it possible to process a large data set. 


18.4.3 Selecting the Ranking Parameter R 


As shown in Figures 18.11 and 18.12, we can see that different R values result in different mapping posi- 
tions of the input vector. The effect of the ranking parameter R can be further illustrated in the following 
example. 

A 3D data set is shown in Figure 18.14, which consists of 300 data points randomly drawn from three 
Gaussian sources. The mean vectors of the three Gaussian sources are [0,0,0]", [3,3,3]", and [9,0,0]%, 
respectively, while the variances are all 1. A SOM of 2 x 2 units is used to project the data points onto 
a 2D space. After training, the prototype vectors of the SOM are shown as plus signs in Figure 18.14, 
which span the input space with three of the map units representing the three cluster centers, respec- 
tively. The input vectors are then projected using the RCP method. The projection results produced with 
different R values are presented in Figure 18.15. 

The effect of R can be seen in this figure. For the case of R = 1, where only the BMU is considered 
in the projection, the map is actually a hit histogram (a small random noise is added to the coordi- 
nates of each data point to show the volume of data points projected onto each map unit). Because 
it can only project input vectors to the map units on a rigid grid, this map does not provide much 


-5 -5 


FIGURE 18.14 Data set I: samples in a 3D data space marked as small circles and prototype vectors as plus signs. 
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FIGURE 18.15 Projection results with different R: (a) R = 1, (b) R= 2, (c) R=3, and (d) R=4. 


information about the global shape of the data. For all possible R values, three major clusters can 
be observed from the map. With R getting larger, the structure and shape of the data become more 
prominent. It is also noticeable that the cluster borders become obscure as R increases. The perfor- 
mance of the RCP depends heavily on the value of the ranking parameter R. It would be beneficial 
to determine the optimal R value automatically for each map based upon certain performance 
metrics. 

If meaningful conclusions are to be drawn from the projection result, as much of the geometric 
relationships among the data patterns in the original space as possible should be preserved through the 
projection. At the mean time, it is desirable for the projection result to provide as much information 
about the shape and cluster structure of the data as possible. Members of each cluster should be close to 
one another while the clusters should be widely spaced from one another. Thus, a combination of two 
quantitative measures, Sammon’s stress [29] and the Davies—Bouldin (DB) index [43], is used in this 
work to determine the optimal R. Sammon’s stress measures the distortion between the pairwise dis- 
tances in both the original and the projected spaces. To achieve good distance preservation, Sammon’s 
stress should be minimized. The DB index attempts to maximize the inter-cluster distance while mini- 
mizing the intra-cluster distance at the same time. It is commonly used as a clustering validity index, 
low values indicating good clustering results. 

For the projection results in Figure 18.15, both Sammon’s stress and the DB index are calculated for 
each R value, which are shown in Figure 18.16a and b, respectively. It can be seen that the two quantita- 
tive measures have contradicting trends. As R grows larger, Sammon’s mapping increases while the DB 
index decreases. It is, hence, impossible to optimize both of the objectives at the same time. We must 
identify the best compromise, which serves as the optimal R in this context. The task of selecting the 
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FIGURE 18.16 (a) Sammon’s stress and (b) DB index for R = 1, 2, 3, 4. 


optimal R now boils down to a bi-objective optimization problem. A typical way to solve the problem is 
to use the weighted sum method, which is stated in Ref. [44]: 


oe JG) gy J2) | (18.19) 
Tio(x) Jao(x) 


where 
J, and J, are two objective functions to be mutually minimized 
J, and J.) are normalization factors for J, and J,, respectively 
a is the weighting factor revealing the relative importance between J, and J, 


In the context of this work, J, and J, correspond to Sammon’s stress and the DB index. Assuming these two 
objective functions have equal importance, a is set to be 0.5. By taking the weighted sum, the two objective 


functions are combined into a single cost function, which is shown in Figure 18.17. The optimization prob- 
lem is, therefore, reduced to minimizing a scalar function. As shown in Figure 18.17, the objective function 
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FIGURE 18.17 The optimal R is found at the minimal point of the single-cost function. 
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reaches its minimum when R equals 2. Consequently, for the 3D example, the condition of R = 2 leads to 
the best compromise between good distance preservation and good clustering quality. 


18.5 Case Studies 


To demonstrate the applicability of the proposed SOM visualization method, simulation results on two 
realistic document sets are presented in this section. The GHSOM Toolbox [45] was used in conducting 
this work. 


18.5.1 Encoding of Documents Using Citation Patterns 


To visualize a document collection, the inter-document relations are first encoded into a similarity 
matrix. Citation-based similarity matrices are used extensively when working with patents and jour- 
nal articles from sources that provide citation data such as the SCI. The SCI provides access to current 
and retrospective citation information for scientific literature published in the physical, biological, and 
medical fields. It presents a great opportunity for citation-based document analysis. 

Each dimension of the similarity matrix corresponds to one document in the set, and the value of 
each element is equal to the relative strength of the citation relationship between the corresponding 
document pair. These citation patterns provide explicit linkages among publications having particular 
points in common, and hence are considered as reliable indicators of intellectual connections among 
documents. Similarities calculated from citations generally produce meaningful document maps whose 
patterns expose clusters of documents and relations among those clusters [46]. 

In our approach, the similarity matrix is constructed based on one type of inter-document citations, 
in particular the bibliographic coupling [47]. Bibliographic coupling between a pair of papers is defined 
as the number of references both papers cite. In bibliometric studies [48], bibliographic coupling is used 
to cluster papers into research fronts, that is, groups of papers that cover the same topic. Using biblio- 
graphic coupling, inter-document similarities are calculated as 


bc; 
sy =e (18.20) 


VNiN; 


where 
bc, is the number of documents cited by both document i and j 
N, and N/are the total number of document citations for document i and j, respectively 


The similarity matrix is, therefore, a symmetric matrix that contains the bibliographic coupling counts 
between all pairs of documents in the database. The rows, or columns, of the similarity matrix are 
vectors corresponding to individual documents. Documents are subsequently clustered using the 
similarity matrix. 


18.5.2 Collection of Journal Papers on Self-Organizing Maps 


The first data set is constructed based on a collection of journal papers from the ISI Science Citation 
Index on the subject of SOMs. Using the term “self-organizing maps” in the general search function of 
ISI Web of Science, a set of 1349 documents was collected corresponding to journal articles published 
from 1990 to early 2005. In this simulation, document citation patterns are used to describe the inter- 
document relationships between pairs of documents, based on which a similarity matrix is built to store 
pair-wise similarity values among papers. After removing the poorly related papers, 638 documents 
remain in the data set. Following the document encoding process, we end up with a 638 x 638 similarity 
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FIGURE 18.19 The projection result of the journal papers on the SOM, where documents are marked as circles. 


matrix, each row (or column) vector describing the citation pattern of a document in a 638 dimensional 
space. These vectors are then used to train a GHSOM network. 

In this simulation, a three-layer GHSOM was generated by setting the thresholds t, = 0.8 and T, = 
0.008, which is illustrated in Figure 18.18. The first-layer map consists of 3 x 4 units. The projection of 
all the documents onto the first layer map is shown in Figure 18.19, which is obtained by using a rank- 
ing parameter R = 5. Five major topic groups can be observed from the map, four located around the 
four corners of the map and one in the lower center. The topics associated with each group are labeled. 
The labels are derived after examining manually the paper titles from each cluster for common subjects. 

To enhance the visualization, the size of document markers can be made proportional to the number 
of times a document has been cited, as shown in Figure 18.20. Important papers, usually distinguished 
by large citation counts, are thus made to stand out on the document map. In this collection of SOM 
documents, three papers are extraordinarily heavily cited, as marked in Figure 18.20. The two large 
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FIGURE 18.20 Anenhanced visualization of the SOM papers, where the size of the document marker is propor- 
tional to the number of times a document has been cited. 


circles in the top part of the map, “Toronen et al. 1999” and “Tamayo et al. 1999,” which appear to be 
closely related, correspond to two important works on applying SOM for clustering gene expression 
data. Tamayo and colleagues used the SOM to cluster genes into various patterned time courses and 
also devised a gene expression clustering software, GeneCluster. Another implementation of SOM was 
developed by Toronen et al. in 1999 for clustering yeast genes. Another heavily cited paper is the Self- 
Organizing Map by Kohonen published in 1990, which is cited by a large portion of the documents in 
this data set. Note that the foundational papers Kohonen published in the 1980s, such as Ref. [1], are not 
available from the ISI Web service. 

Based on the initial separation of the most dominant topical clusters, further maps were automati- 
cally trained to represent the various topics in more detail. Nine individual maps are developed in the 
second layer, each representing the documents of the respective higher layer unit in more detail. One 
example is illustrated in Figure 18.21. This figure shows the projection result of a submap, expanded 
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FIGURE 18.21 The projection result of a submap. 
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from the second left node in the bottom of the first-layer map. The submap consists of 3 x 3 units, 
representing a cluster of papers covering the theoretical aspect of the SOM. The Kohonen’s seminal 
paper is located in this submap. Some of the nodes on these second-layer maps are further expanded as 
distinct SOMs in the third layer. Due to the incompleteness of this document collection, limited infor- 
mation about this subject domain is revealed from the map displays. 


18.5.3 Collection of Papers on Anthrax Research 


The second data set is a collection of journal papers on anthrax research, which is also obtained from ISI 
Web of Science. Anthrax research makes an excellent example for testing the performances of document 
clustering and visualization. The subject is well covered by the Science Citation Index. A great deal of the 
research has been performed in the past 20 years. A review paper [49] is available where the names of 
key papers in this field are identified and discussed. The anthrax paper set we collected for this simula- 
tion contains 987 documents corresponding to journal papers published from 1981 to the end of 2001. 

A 987 x 987 similarity matrix was formed to train a GHSOM. A three-layer GHSOM was generated 
by setting the thresholds t, = 0.78 and T, = 0.004, as illustrated in Figure 18.22. The first-layer map con- 
sists of 3 x 4 units. The projection result of all documents onto this layer is shown in Figure 18.23, which 
was produced by setting the ranking parameter R to 3. 

In Figure 18.23, several clusters can be seen on the map with their topics labels. Starting from the 
upper left corner of the map and going clockwise, we can see the topics of the papers change with differ- 
ent locations on the map. The cluster of documents in the upper left corner is focused on how anthrax 
moves, interacts with, and enters host cells. Note that several smaller groups are visible inside this cluster, 
which implies expansion of this cluster in a further layer would reveal several sub-topics. To the right, the 
documents in the upper center of the map are found to deal with anthrax genes. The documents located 
in the upper right corner cover biological effects of anthrax, while the cluster right below covers the effect 
of anthrax on the immune system. In the lower right corner of the map, another group of documents 
exist, which deals with the comparison of anthrax and other Bacillus strains. A tight cluster is formed in 
the lower left corner of the map, which discusses the use of anthrax as a bio-weapon. As a whole, several 
obvious groups of documents are formed on the map, which relate to different research focuses in the 
context of anthrax research. Fundamental research topics are located in the upper portion of the map, 
which are somewhat in vicinity of one another. There are no obvious borders between these groups as the 
topics are closely interrelated. On the contrary, other relevant topics on anthrax are mapped to the lower 


FIGURE 18.22 The resulting three-layer GHSOM for the collection of anthrax papers. 
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FIGURE 18.23 _ First-layer projection of the anthrax journal papers. 


portion, which are rather far from the fundamental research topics and from one another. It can be seen 
that the geometric distance indicates the degree of relevance among documents. There are also many 
documents sitting between clusters, which are the result of the heavily overlapped research coverage. 
More information about the document set can be gained by identifying the seminal papers in it. 
For this purpose, the document marker sizes are made proportional to the number of times they have 
been cited. The result is shown in Figure 18.24. Several seminal papers can be identified from Figure 
18.24, five of which will be identified in the following as examples. The earliest seminal paper is the 
Gladstone paper published in 1946, in which Gladstone reported on the discovery of protective anti- 
gen. In the 1950s, Smith and Keppie showed in their paper that anthrax kills through a toxin. These 
papers are landmark papers forming the foundation for anthrax research and they fall into the cluster of 
anthrax effect on immunity. Later, another influential paper was published in 1962 when Beall showed 
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FIGURE 18.24  First-layer projection of the anthrax papers set. 
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FIGURE 18.25 Second-layer document projection. 


anthrax has a three-part toxin. Leppla announced the edema factor in his paper in 1982. These papers 
mainly deal with the effect of anthrax on host cells. Another heavily cited paper was on macrophages 
published by Freidlander in 1986, which became the key paper in this area. 

All of the neurons of the first-layer SOM are expanded to the second layer to represent the respective 
topics in more detail. The resulting second-layer map of the upper-left node is shown in Figure 18.25, which 
has a total of 192 documents. In the second layer, the documents are further clustered into three groups: 
anthrax effect on macrophages, anthrax delivery, and anthrax interaction. This result is consistent with the 
first-layer representation. One unit on this second-layer map is further expanded in the third layer. 


18.6 Conclusion 


This chapter introduces an approach for clustering and visualizing high-dimensional data, especially 
textual data. The devised approach, which is an extension of the SOM, acts as an analysis tool as well as a 
direct interface to the data, making it a very useful tool for processing textual data. It clusters documents 
and presents them on a 2D display space. Documents with a similar concept are grouped into the same 
cluster, and clusters with similar concepts are located nearby on a map. 

In the training phase, the proposed approach employs a GHSOM architecture, which grows both 
in depth according to the data distribution, allowing a hierarchical decomposition and navigation 
in portions of the data, and in width, implying that the size of each individual map adapts itself to 
the requirements of the input space. After convergence of the training process, a novel approach, 
the RCP, is used to project the input vectors to the hierarchy of 2D output maps of the GHSOM. The 
performance of the presented approach has been illustrated using two real-world document collec- 
tions. The two document collections are scientific journal articles on two different subjects obtained 
from the Science Citation Index. The document representation relies on a citation-based model that 
depicts the inter-document similarities using the bibliographic coupling counts between all pairs of 
documents in the collection. For the given document collections, it is rather easy to judge the qual- 
ity of the clustering result. Although the resulting SOM maps are graphical artifacts, the simulation 
results have demonstrated that the approach is highly effective in producing fairly detailed and objec- 
tive visualizations that are easy to understand. These maps, therefore, have the potential of providing 
insights into the information hidden in a large collection of documents. 
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19.1 Introduction 


Fuzzy models have become extremely popular in recent years [DHR96,P93,P01,CL00,YF94]. They 
are widely utilized as the model of systems or nonlinear controllers. Because of their nonlinear 
characteristics, they are especially suitable for modeling or control of complex ill-defined dynamic 
processes. The application of fuzzy systems has different advantages. For example, they allow designing 
the controller for the processes where mathematical models do not exist. Unlike the classical methodology, 
which requires the existence of the model, the fuzzy system can be designed using only the information 
on the process behavior. 

In this chapter, the relations between classical linear PI/PID controllers and fuzzy-logic-based 
controllers as well as an overview of different fuzzy models are presented. Due to the limited length of 
the work, only the major properties of the considered systems are described and selected, and, in the 
authors’ opinion, the most important, fuzzy systems are presented. 


19.2 Fuzzy versus Classical Control 


According to the paper [AH01], more than 90% of industrial processes are still controlled by means of 
the classical PI/PID. They are commonly applied in industry due to the following factors: 


e The classical methods are commonly known and well understood by working engineers. 

e There are a lot of tuning methods for the linear controller, which can be easily implemented for 
a variety of industrial plants. 

¢ ‘The stability analysis of the systems with linear controllers is much simpler than for the plants 
with nonlinear controllers. 

e The area of fuzzy modeling and control in industry lacks specialists. 


The output of the PID controller is a sum of the signal from three paths: proportional (P), integrating (I), and 
differentiating (D). The relationship between system output and input is described by the following equation: 


u= Kpep + Kye + Kpep (19.1) 
19-1 
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(a) (b) 
FIGURE 19.2 Direct (a) and the incremental form (b) of the fuzzy PID controller. 


By setting the coefficient K, to zero the PI controller is obtained, and by setting the K, part to zero the 
PD controller is achieved. In Figure 19.1 the ideal PID controller is presented. In real-time application, 
the output saturation as well as one of the selected anti-windup strategies have to be implemented. Also, 
in the differentiating path the low-pass filter should be used in order to decrease the effects of the high- 
frequency noises. 

Based on the structure presented in Figure 19.1, it is easy to go into fuzzy representation of PID 
controller. Two general structures of the PID fuzzy controllers are presented in Figure 19.2. The form 
directly related to Figure 19.1 is presented in Figure 19.2a, whereas the incremental form is shown in 
Figure 19.2b. The incremental form of the system is more commonly applied for the fuzzy PI controller 
due to the simplicity of limitation of the output signal. 

The fuzzy PID controller connects the controller outputs and its inputs by the following relationship 
for the direct form: 


u= f (ep,e1,ep) (19.2) 


and for the incremental form 
Au= f (epseps€pp) (19.3) 


Unlike the classical PID controller, where the linear addition of the incoming signals is real- 
ized, the fuzzy controller transforms the incoming signals to the output using nonlinear fuzzy 
relationship. This transformation can be regarded as the nonlinear addition. In order to illustrate 
this feature, a hypothetical control surfaces for the classical PI and fuzzy PI controllers are shown 
in Figure 19.3. 

As can be seen from Figure 19.3, the control surface of the classical system is linear, only the slope 
of this surface can be changed by modifying the controller coefficients. In contrast to the classical 
controller, the fuzzy system can realize any nonlinear control surface. This demonstrates the power 
of the fuzzy control because it is suitable for any nonlinear plant. It should also be highlighted that 
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FIGURE 19.3. Hypothetical control surface of the classical (a) and fuzzy (b) PI controllers. 


the control surface of the fuzzy controller can be linear, so the properties of the controlled object 
with fuzzy controller can never be worse (or at least the same) than with the classical controller. 
It is widely established that the fuzzy controllers can be applied in the following situations: 


« When there is no mathematical model of the plant in the form of differential or difference 
equations. 

« Where due to the nonlinearities, the application of the classical methods is impossible (or not effective). 

« When the aim of the control is given in a vague way (e.g., such term as smoothly is used). 


The application of fuzzy control to the linear plants is sometimes questioned by scientists. It may be 
difficult to justify and everything depends on the form of the used control index, which can be repre- 
sented in the following general form: 


K =min af erat +(1- a)| uwdt (19.4) 
0 0 


where 
is an weight factor 
tis time 
e is the control error 


Even if the controller and the object are linear, the presented control index is nonlinear. It means that 
due to the form of the control index, the control problem becomes nonlinear. The nonlinear fuzzy con- 
troller generates many possibilities to minimize the nonlinear control index. Therefore, the application 
of the fuzzy controller to linear objects is justified. 

The nonlinear fuzzy model (Figure 19.2) relies on the nonlinear relationship between the inputs and out- 
puts of the fuzzy controller. In the literature, different systems can be found, so the next section is devoted 
to the general structure of the fuzzy system. Next an overview of different fuzzy models is provided. 


19.3 Fuzzy Models 
19.3.1 General Structure of Fuzzy Models 


The issues of system modeling have been widely investigated in a vast numbers of papers and books. 
Reliable models which can describe the input(s)—output(s) relationship(s) accurately enough are sought 
after. This problem is especially important in such fields as monitoring, fault detection, control, etc. 
[DHR96,P93,P01]. A number of classical systems have been proposed in the literature. However, due to 
the system nonlinearity and complexity the application of the classical methods is not always preferable. 
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FIGURE 19.4 Basic structure of the fuzzy system. 


Contrary to them, the fuzzy models possess the flexibility and intuition of the human reasoning; therefore, 
they are designated to the model nonlinear ill-defined processes. 

In the literature, different types of fuzzy models can be found, varying in membership functions, 
inference methods, paths of the flowing signals, etc. At first glance, they can appear totally dissimilar, 
when comparing, for example, the Mamdani and wavelet neuro-fuzzy systems. Nonetheless, all of them 
have a common basic structure shown in Figure 19.4. This structure consists of four basic blocks: fuzzy- 
fication, rule base with the fuzzy inference engine, and defuzzyfication [DHR96,P93,P01,CL00,YF94]. 

In the fuzzyfication block, the incoming sharp values are transferred to fuzzy values. In order to 
conduct this operation, input membership functions of the system have to be defined unambiguously. 
The shape of the membership functions exerts influence on the obtained fuzzy models. Due to their low 
computational effort, the triangular or trapezoidal functions are widely applied. In the ranges where 
the system should react very sensitively to the input values, the membership function should be narrow, 
which allows to distinguish different values more accurately. However, it should be pointed out that the 
number of rules grows rapidly with the number of membership functions. The output of the fuzzyfica- 
tion block is a vector of the parameters X; = [U4), ..., Ug,] describing the degree of membership of the 
input signals to the related membership functions. A hypothetical fuzzyfication process for one input 
value x, = 3 and two membership functions A, i A, is presented in Figure 19.5. 

The next block of the system is called fuzzy inference engine. On the basis of the input vector X,, it 
calculates the resulting membership function which is passed to its output. The separate block entitled 
rule base (Figure 19.4) of if-then type includes several rules exerting the most important influence on 
the fuzzy model. Fuzzy inference engine connects the knowledge incorporated in if-then rules using 
fuzzy approximate reasoning. 

In the first step, the fuzzy inference engine calculates the degree of fulfillment of the rule premises on 
the basis of X;. The bigger the degree, the more influence on the output. Depending on the form of the 
rule, a different operation is used. 

For the AND premise, it is the t-norm, for the OR premise it is the s-norm. It should be pointed out 
that in the literature different types of the t-norm such as min, prod, Hamacher, Einstein, etc., and 
various types of s-norms such as max, sum, Hamacher, Einstein, etc., are cited. In Figure 19.6, the way 
of calculating the AND and OR premises is shown. 


xj=3 


FIGURE 19.5 Hypothetical fuzzyfication process. 
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FIGURE 19.6 Hypothetical calculating of the premises using AND and OR operations. 


Then the shape of output membership functions of each rule is determined. This operation is called 
implication and is conducted with the help of a fuzzy implication method. Two commonly used implica- 
tion methods, Mamdani and prod, are shown in Figure 19.7. 

As a result of the implication process, a specific number of conclusion membership functions is 
obtained. Thereafter, the aggregation process is carried out. From the output membership functions 
using the s-norm operation, one resulting fuzzy set is obtained. A hypothetical aggregation process with 
two output membership functions and max s-norm method is presented in Figure 19.8. 


u(x) u(y) 


H4(x) =0.5 


(b) xj=4.5 


FIGURE 19.7 The idea of the Mamdani and prod implication methods. 


FIGURE 19.8 Hypothetical aggregation process using two fuzzy functions and max s-norm operation. 
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FIGURE 19.9 (a) The first maximum and (b) the mean of maximum defuzzyfication methods. 


The last block of the fuzzy system includes the defuzzyfication procedure. Its goal is to compute the 
sharp value from the resulting fuzzy set. This value is submitted to the output of the whole fuzzy system. 
From different defuzzyfication methodologies, the following methods are worth mentioning: 


e The first (last) maximum 
e The mean of maximum 

¢ ‘The center of gravity 

¢ The high (singleton) 


As the output value of the first (last) maximum defuzzyfication method, the first (the last) maximum 
value of the resulting fuzzy set is taken. In the mean of maximum method, the output value is selected 
as the point between the first and the last maximum values. The graphic representation of the two 
described methods is presented in Figure 19.9. 

The main advantage of the presented methods is their computational simplicity. However, there are 
also very serious drawbacks to them. The output sharp value is influenced only by the most activated 
fuzzy set which means that the fuzzy system is insensitive to changes of the other sets. Additionally, 
when these methods are used, the output of the system can change rapidly and due to this reason they 
are very rarely applied in real systems. 

The center of gravity method is a commonly used defuzzyfication procedure. The output sharp value 
is related to the center of gravity of the output fuzzy set. All the activated fuzzy membership functions 
(related to specific rules) affect the system output, which is the advantage of this method. This method 
has also significant drawbacks though. One commonly referred disadvantage is its computational com- 
plexity resulting from the integration of the nonregular surface. Still, it should be mentioned that this 
problem can be solved by off-line calculation of the resulting output for all combinations of the inputs. 
Then those values are stored in processor memory and are quickly accessible. The other drawbacks are 
as follows. Firstly, when only one rule is activated, the system output does not depend on the level of 
degree of firing. This situation is presented in Figure 19.10. 

The center of gravity method requires the application of adjacent fuzzy sets with a similar weight. 
Otherwise, the activation level of the narrow set cannot influence the system output significantly. It can 
be desirable in some applications, yet in the most cases is not preferable. A fuzzy system with member- 
ship functions of different weights is shown below (Figure 19.11). 

As can be concluded from Figure 19.11, the change of the firing of the fuzzy sets B, influences the 
resulting output y, of the system in an insubstantial way. 

This defuzzification method can have a narrow output range. Even if only the first (the last) 
fuzzy set is fully activated, the output of the system does not reach the minimum (maximum) value 


Jo J Jo J 


FIGURE 19.10 Insensitivity of the system output in the case of one activated set. 
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u(y) By B, 


Yo y y 
FIGURE 19.11 Output of the fuzzy system with membership functions with different weights. 
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FIGURE 19.12 Narrow of (a) the output range and (b) the way of its elimination. 


(Figure 19.12a). This problem is solved by extending the fuzzy sets to the regions outside the uni- 
verse of discourse (Figure 19.12b). Still, it should be ensured that the system cannot yield the output 
outside the range of the permitted output values. 

One of the most popular defuzzyfication methods applied in the real system, especially in time- 
consuming applications, is the height (singleton) defuzzyfication method, which relies on the replace- 
ment of the output fuzzy sets by singleton values. Then the resulting output is calculated according to 
the following equation: 


m 
ded 


Yo (19.5) 


jae 


where 
y; is the value of the suitable singleton 
Lg; is the value of the antecedent part of the suitable rule 


The major advantage of this method is its computational simplicity resulting from the replacement 
of the integration of nonregular shape of the set (in previous method) by sum and product operations. 
Sensitivity and continuity are its further strong points. 


19.3.2, Mamdani (Mamdani-—Assilian) Model 


The first fuzzy system to be introduced is called Mamdani (in some papers Mamdani-Assilian) fuzzy 
model. It was applied to control a real steam engine system in 1975 [AM/74]. 
This model consists of several if-then rules in the following form: 


R1:IFx, =A, AND x, =A AND... AND x; = Ai; THEN y= B, 
(19.6) 

Rn: IF x, = A, AND x, = Ay AND... AND x; = A,; THEN y= B, 
In its original form, it used the following operators: t-norm—min, implication—Mamdani, aggregation— 
max, defuzzyfication—center of gravity method. Nowadays, the Mamdani system employing different 


operators is cited in the literature. The Mamdani system is very popular in various applications, from 
the simulation works to real-time systems [AM74]. 
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IF x, =A, AND x=A,) THEN y=B, 
IF x; =A, AND x)= A 7) THEN y= By 
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FIGURE 19.13 Illustration of the Mamdani system computation scheme. 


A calculation of the output value of the hypothetical Mamdani system is presented in Figure 19.13. It 
consists of two rules in the following form: 


R1:IF x =Ay AND X2 =An THEN y=B, 


(19.7) 
R2:1IF x = Ar AND x2 = Ay THEN y= B, 


Initially, the fuzzyfication procedure of the two input signals is conducted. Then the activation degree 
of the premises using the min operation is determined and the shape of the conclusions of membership 
functions is obtained through the Mamdani implication method. Afterward, the resulting fuzzy set is 
determined using the max aggregation procedure. With the help of the center of gravity defuzzyfication 
method the output of the whole fuzzy system is worked out. 

The Mamdani model consists of several fuzzy rules each of which determines one fuzzy point in 
the fuzzy surface. The set of the fuzzy points forms the fuzzy graph in which interpolation between 


FIGURE 19.14 Division of the fuzzy surface to separate sectors. 
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points depends on the operators used in the fuzzy model. The fuzzy pee 
surface possesses specific properties, if the input triangular member- Xs incl ca ia 
ship functions and replacement of conclusion fuzzy sets with single- A,, |B, |B, |B, |B, 
ton values are applied. This is illustrated in Figure 19.14, where the A, |B; |B, |B, | Bs 
system with 16 rules and parameters presented in Figure 19.15 is Az; | By | By | By | Bi 
considered. B 


Every fuzzy rule defines the fuzzy surface around the fuzzy point 


with suitable coordinates. For instance, the rule: FIGURE 19.15 Rule base of the 


system presented in Figure 19.14. 


R7: IF x, = Aj; AND x, = Ay, THEN y = B,; 


defines the neighborhood of the point with the following coordinates (a,,, a,,). The control surface of the 
analyzed fuzzy system consists of nine sectors. The support points are determined by the singleton of 
each rule. A change of the specific singleton value brings about a slope of the four neighboring sectors. 
It should be noted that this change does not influence the rest of the sectors, that is, the modification of 
the singleton value works only locally. A modification of the selected input membership function affects 
each rule of this set. For instance, a change of the A,, set moves the following support points, a,,-a,,, 
A19—Ay5 Ay,—Ay3, Ay,—A,,. Therefore, this modification has a global character because it changes the whole 
cross section. 


19.3.3 Takagi-Sugeno Model 


The next well-known system is called Takagi-Sugeno-Kang (TSK) model. It was proposed in 1985 by 
Takagi and Sugeno [TS85] and later in 1988 by Sugeno and Kang [SK88]. Nowadays, it is one of the most 
frequently applied fuzzy systems [EG03,ELNLN02]. It consists of several rules in the following form: 


R1:IF x, = Aj; AND x, = Ay, AND... AND x; =A,; THEN y= f(x), x2...xj, Xo) 


(19.8) 


Rn: IF x, =A,, AND x, = A,. AND... AND x; = A,; THEN y= f(x), %2...xj, Xo) 


where x, is a constant value. 

The difference between the Mamdani and TSK model is evident in the conclusion part of the rule. 
In the Mamdani system, it is rendered by the fuzzy set, while in the TSK system it is a function of the 
input variables and constant value. 

The output of the TSK model is calculated by means of the following equations: 


(19.9) 


An illustration of the TSK system computational scheme is presented in Figure 19.16. After the fuzzy- 
fication procedure of the two input values x, and x,, the degree of the premises part of each rule is 
computed using the min operator as the t-norm. Unlike in the Mamdani system, the consequent part 
of the rule is a function of the input variables. After its calculation, the implication and aggregation 
methods are applied to the system. Then the output of the system is conducted by means of the singleton 
defuzzyfication strategy. 
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IF x, =A,, AND x2=Aj) THEN y = %1* 411 + %9* 442+ Xp 


IF x, =A , AND x, =A 9. THEN y = 1% dy + %9* 499+ Xp 
Jo 


FIGURE 19.16 Illustration of the TSK system computation scheme. 


(x) 


FIGURE 19.17 Hypothetical division of the model surface in TSK system. 


As compared to the Mamdani system, the TSK model has the following advantages. First of all, it 
allows reducing the computational complexity of the whole system. This stems from the fact that the 
integration of the nonlinear surface is replaced by the sum and prod operations (in defuzzyfication 
methodology). Furthermore, by suitable selection of the input membership functions, it is possible to 
obtain the sectors in the control surface depending only on one rule, which simplifies optimization of 
the fuzzy system. Due to this reason, the TSK model is often called a quasi-linear fuzzy model. A sche- 
matic division of the model surface in a system with two inputs and with trapezoid input membership 
functions is presented in Figure 19.17. 

As can be concluded from the figure, the use of the trapezoid membership functions allows obtain- 
ing the model surface with the sectors relying on one rule. Those sectors are marked with f,-f, in Figure 
19.17. In the shaded region, the system output is determined by two or four rules. 


19.3.4 Tsukamoto Model 


The Tsukamoto model was proposed in 1979 in [T79]. The main difference between the TSK and 
Tsukamoto models lies in the conclusion part of the rule. In the TSK system, the position of the singletons 
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IF x; =A,, AND *)= Aj THEN y,=f (1, 2, Xo) 
IF x; =A 9, AND x2=Ag2 THEN yo=f (x1, X9, Xo) 


Yo 
FIGURE 19.18 Illustration of the Tsukamoto system computation scheme. 


is the function of the input signals and its amplitude depending on the firing degree of premise the rules. 
In the Tsukamoto system, on the other hand, the position and at the same time, the amplitude of the 
output singletons are functions of the degree of activation of each rule. The Tsukamoto system compu- 
tational scheme is shown in Figure 19.18. 

The Tsukamoto system has monotonic functions in the consequent part of the rule. The level of the 
firing of each rule defines the position and amplitude of the output singletons. Then the output of 
the Tsukamoto model is calculated similarly as in the TSK mode. Nevertheless, the Tsukamoto model 
is very rarely applied due to its complexity and difficulty of identifying the functions in the consequent 
part of the rules. 


19.3.5 Models with Parametric Consequents 


The system with parametric conclusions was proposed in 1997 by Leski and Czogala [LC99]. The param- 
eters of the fuzzy sets in their conclusions are a function of the input variables in this system. The con- 
sidered parameters are as follows: location, width or core, height of the input membership functions, 
and others. This system is also called model with the moving consequences. Figure 19.19 presents an 
example of fuzzy reasoning for the system with parametric conclusions with two rules. 

The operation of computing the premises of individual rules is identical to this in Mamdani or TSK 
system. The firing degrees of the premises are using to calculate the shape of the conclusion membership 


IF x, =A,, AND x9 = Ay) THEN y= B,(xp) 


IF #, =A, AND x= Ag) THEN y= Bo(xo) Y 


FIGURE 19.19 Process of reasoning in the system with parametric conclusions. 
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functions. In the discussed case, the Reichenbach implication is used. The fuzzy sets in conclusions are 
divided into the informative and non-informative part. Employing a selected operation of aggregation 
and defuzzyfication the output value of the fuzzy system is achieved. A vice of the presented system is the 
difficulty in obtaining rules with parametric conclusions from a human expert. The authors recommend 
utilizing this system for the procedure of automatic rule extraction on the basis of measurement data. 


19.3.6 Models Based on Sets of the II-Type Fuzzy Sets 


Fuzzy systems of the II type were put forward by Mendel in 1999 [KMQ99]. They were introduced as a 
result of discovered contradiction between uncertainty of information and exact defining of the values 
of classical fuzzy sets [KMQ99,LCCS09]. Data acquired from different operators can vary hence the 
information is not determined unambiguously. Similarly, in systems generating rules automatically on 
the basis of measurement data disturbances have to be accounted for. In such situations, classical sets 
(the so-called type 1 sets) do not prove useful due to their precisely defined shape. Type II Gauss sets 
with various types of uncertainty of information are displayed in Figure 19.20. The shape and width of 
the uncertainty area depends on the considered applications. 

II-type fuzzy sets can appear both in the premises and conclusions of the fuzzy rules. In order to sim- 
plify the computational algorithm, most authors apply II-type fuzzy sets only in the premises of the rules 
and the conclusions can be the Mamdani, TSK, or a different system. The scheme of the output computa- 
tion in the system with the fuzzy input II-type membership functions is presented in Figure 19.21. 


E(x) u(x) 
x x 
(a) (b) 


FIGURE 19.20 Gauss II-type fuzzy sets with uncertainty of (a) modal value and (b) weight value. 


U 


to, tl 


er 


I 

> 
yo I 
Output for Output for 


lower membershi upper membershi 
IF x; =A, AND x5= Aj) THEN y=) + 41) +%2 ,2+Xq functions P gy PP fanetions P 


IF x, =A, AND x)= Aj) THEN y =4 + Gp) +%2* G92 4+%Xp Type Yo+To 
reduction J0== 2 


FIGURE 19.21 _ Illustration of the system computation scheme with the II-type input fuzzy sets. 
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The consequent parts of the rules have the TSK form. The fuzzyfication of the input variables is car- 
ried out for lower and upper ranges of the membership function. So the firing of the premises is repre- 
sented by two separate values: the lower and the upper degree. Next, the implication and aggregation 
operations are carried out for the upper and lower values separately. The additional block in this fuzzy 
system is called the type reduction. The one sharp value y, is calculated on the basis of the two values 
through the following equation (among others): 


Yo + Yo (19.10) 
2 


Yo = 


where yp + %) denote the sharp values determined for the lower and upper values of the II-type member- 
ship function. 


19.3.7 Neuro-Fuzzy Models 


Optimization of the fuzzy system is a very important issue. A variety of methods, such as cluster, evo- 
lutionary strategy, etc., can be used for this purpose. One of the methods relies on the transformation 
of the classical system to a form of the neuro-fuzzy structure and the use of one of the classical neural 
network training methods (e.g., back propagation algorithm) [P01,CL00]. The types of neuro-fuzzy sys- 
tems found in the literature are classical TSK, wavelet neuro-fuzzy systems and other. They all have the 
same general structure presented in Figure 19.22. 

In the neuro-fuzzy system, the following layers can be distinguished: 


Input layer. Each input node in this layer corresponds to a specific input variable (x,, x,...x,). These 
nodes only pass input signals to the first layer. 

Layer 1. Each node performs a membership function A, that can be referred to as the fuzzyfication 
procedure. 


Layer 2. Each node in this layer represents the precondition part of fuzzy rule and is denoted by T, which 
conducts a t-norm operation and sends the results out. 


Output 
layer 


Input layer — Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 


FIGURE 19.22 General structure of neuro-fuzzy with two inputs, six fuzzy input membership functions, and 
nine rules. 
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FIGURE 19.23 Hypothetical TSK neuro-fuzzy system with four rules. 


Layer 3. In this layer, the function in the consequent part of the rule is calculated. It is a function of the 
input variables and/or a constant value. 


Layer 4. In this layer, the output membership functions are computed. It combines, using a selected 
inference method, the activation level of each premise with the value of the consequent function. 


Layer 5. This layer acts as a defuzzyfier. The single node is denoted by = and performs the summing up 
of all incoming signals. Then a selected defuzzyfication strategy is carried out. 


Output layer. This is solely the output of the neuro-fuzzy system. 


The wavelet neuro-fuzzy inference system is obtained by putting selected wavelet functions into Layer 3 
(conclusion part of the rules). The rest of the structure remains the same. The II-type fuzzy inference 
system can be realized by putting the II-type membership functions into the first layer or/and into the 
third layer. In this case, a block which can realize the type reduction of the system (from the II-type to 
the single value) has to be added to the presented structure. 

In Figure 19.23, hypothetical TSK neuro-fuzzy system for parameters shown in Figure 19.24 is pre- 
sented. It is adequate to the classical TSK system. The name neuro-fuzzy refers to a different way of 
presenting the system structure (Figure 19.16). 

From this figure, the relationship of the consequent parts of the rules form the input signals and the 
constant value can be clearly seen. In some adaptive systems, the sum of the premises is not calculated 
and the sum of singletons multiplied by suitable firing values is given as an output value. 

As it was pointed previously, the most important advantage of the neuro-fuzzy system is the pos- 
sibility to use the training methods developed for classical neural networks. Examples of the off-line 
or on-line training system can be found in the literature. One of the most popular adaptive neuro- 
fuzzy systems is the Adaptive Neuro-Fuzzy Inference System (ANFIS) as proposed by Jang in [J93]. 
Other types of adaptive neuro-fuzzy models are the Neural Fuzzy CONtroller (NEFCON) [NNK99] or 
Artificial Neural Network Based Fuzzy Inference System (ANNBFIS) [LC99], etc. 


NB PB 


XX 


FIGURE 19.24 Input membership functions and rule base for the system presented in Figure 19.23. 


© 2011 by Taylor and Francis Group, LLC 


Fuzzy Logic Controllers 19-15 


FIGURE 19.25 Recurrent neuro-fuzzy system with simple and hierarchical feedbacks. 


In contrast to the pure feed-forward architecture of the classical neuro-fuzzy system, the recurrent 
models have the advantage of using the information from the past, which is especially useful to model- 
ing and analyzing dynamic plants [ZM99,JC06,J02]. Recurrent neuro-fuzzy systems can be constructed 
in the same way as the standard feed-forward systems. However, due to their complexity optimization of 
the system parameters is significantly more difficult. A neuro-fuzzy system with two types of recurrent 
feedbacks is presented in Figure 19.25. 

The first are simple feedback units put to the membership function in the antecedent part of the rules. 
The second type is hierarchical feedback units, which connect the output of the whole system with the 
second input. 

Examples of significantly more complicated structures of recurrent neuro-fuzzy systems exist in the 
literature. An example of system consisting of two subsystems is presented in Figure 19.26. 

At the beginning, the output of the first subsystem is calculated. Next, it is used as an additional 
feedback to the second subsystem. The delayed output of the second system is fed back to the first and 
alternatively also to the second subsystem. The recurrent feedbacks can be combined with every type 
of fuzzy system. For example, the II-type recurrent wavelet neuro-fuzzy system can be designed yet the 
optimization difficulty of such a complicated structure should be pointed out. 


FIGURE 19.26 Recurrent neuro-fuzzy system consisting of two subsystems. 
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19.3.8 Local and Global Models 


The first fuzzy models analyzed in the literature have the global character, that is, the whole universe of 
discourse has been divided evenly by the grid. In a real system, though, there exist flat and at the same 
time also very steep regions. In order to model the steep regions’ accuracy, the grid lines have to be very 
dense. This increases the number of rules of the whole system and the optimization procedure of this 
type of system can be complicated [P01,ZS96]. The described situation is presented in Figure 19.27a. The 
displayed system has two “peeks.” In order to model the surface of these peeks with sufficient accuracy, 
the grid is quite dense and the number of rules for the whole system is as high as 144. In order to reduce 
the number of rules, the whole system is divided into four separate regions (Figure 19.27b). Two flat 
regions are modeled by 8 rules, two steep regions by 72 rules. So when the local models are used, the 
reduction of the number of fuzzy rules from 144 to 72 in this particular case is achieved. 

The desired feature of local fuzzy models is the continuity of the modeled surface on the points of 
contact of different models. Because the parameters of the local models are obtained separately, this 
condition is usually not fulfilled. Thus, the system presented in Figure 19.28 can be applied to ensure the 
continuity of the global model. 

The total output of the global model is calculated on the basis on the responses of the local models. Then 
those signals go to the aggregation block, usually with trapezium functions with changeable value of the 
membership function in the areas of contact of the local models, which calculates the output of the system. 
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Output 
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FIGURE 19.28 Aggregation block of the local fuzzy systems. 
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19.4 Summary 


This chapter is devoted to the presentation of the properties of the fuzzy controller. Similarities and dif- 
ferences between the classical and the fuzzy PID controller are pointed out and the fields of application 
of fuzzy systems are briefly described. Then the general structure of the fuzzy system is presented. Basic 
operations of the fuzzy models are described and a survey emphasizing distinctions between different 
fuzzy models is put forward. Due to the limited length of the chapter, other very important issues such as 
stability analysis or optimization of the fuzzy controller etc. are not considered here. Readers are referred 
to the variety of books in the following topic, for example, [DHR96,P93,P01,CL00,Y F94]. 
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20.1 Introduction 


Conventional controllers, such as a PID controller, are broadly used for linear processes. In real life, 
most processes are nonlinear. Nonlinear control is considered as one of the most difficult challenges in 
modern control theory. While linear control system theory has been well developed, it is the nonlinear 
control problems that cause most headaches. Traditionally, a nonlinear process has to be linearized first 
before an automatic controller can be effectively applied [WB01]. This is typically achieved by adding 
a reverse nonlinear function to compensate for the nonlinear behavior so the overall process input- 
output relationship becomes somewhat linear. 

The issue becomes more complicated if nonlinear characteristic of the system changes with time and 
there is a need for an adaptive change of the nonlinear behavior. These adaptive systems are best handled 
with methods of computational intelligence such as neural networks and fuzzy systems [W02,W07]. 

In this chapter, the neuro-fuzzy system [WJK99,W09], as a combination of fuzzy system and neural net- 
works, will be introduced, and compared with classic fuzzy systems, based ona simple case (Figure 20.1). 

The studying case can be described as the nonlinear control surface, shown in Figure 20.1. All points 
(441 points in Figure 20.1a and 36 points in Figure 20.1b) in the surface are calculated by the equation 


z=1.lexp (—0.07(x — 5)’ —0.07(y —5)”) — 0.9 (20.1) 


20.2 Fuzzy System 


The most commonly used architectures for fuzzy system development are the Mamdani fuzzy system 
[M74,MWO]1] and the TSK (Takagi, Sugeno, and Kang) fuzzy system [TS85,SK88,WB99], as shown in 
Figure 20.2. Both of them consist of three blocks: fuzzification, fuzzy rules, and defuzzification/normal- 
ization. Each of the blocks could be designed differently. 


20-1 
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FIGURE 20.1 Required surface obtained from Equation 20.1: (a) 21 x 21=441 points and (b) 6 x 6=36 points. 
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FIGURE 20.2 Block diagram of the two types of fuzzy systems: (a) Mamdani fuzzy system and (b) TSK fuzzy system. 


20.2.1 Fuzzification 


Fuzzification is supposed to convert the analog inputs into sets of fuzzy variables. For each analog 
input, several fuzzy variables are generated with values between 0 and 1. The number of fuzzy variables 
depends on the number of member functions in the fuzzification process. Various types of member 
functions can be used for conversion, such as triangular and trapezoidal. One may consider using a 
combination of them and different types of membership functions result in different accuracies. Figure 
20.3 shows the surfaces and related accuracies obtained by using Mamdani fuzzy system with different 
membership functions, for solving the problem in Figure 20.1. 

One may notice that using the triangular membership functions one can get better surface than from 
using the trapezoidal membership functions. 

The more membership functions are used, the higher accuracy will be obtained. However, very dense 
functions may lead to frequent controller actions (known as “hunting”), and sometimes this may lead 
to system instability; on the other hand, more storage is required, because the size of the fuzzy table is 
increased exponentially to the number of membership functions. 


20.2.2 Fuzzy Rules 


Fuzzy variables are processed by fuzzy logic rules, with MIN and MAX operators. The fuzzy logic can 
be interpreted as the extended Boolean logic. For binary “0” and “1,” the MIN and MAX operators 
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FIGURE 20.3 Control surface using Mamdani fuzzy systems and five membership functions per input: (a) trapezoi- 
dal membership function, error = 3.8723 and (b) triangular membership function, error = 2.4799. 


TABLE 20.1 Binary Operation Using Boolean Logic and Fuzzy Logic 


A B A ANDB MIN(A,B) A ORB MAX(A,B) 
0 0 0 0 0 0 
0 1 0 0 1 1 
1 0 0 0 1 1 
1 1 1 1 1 1 


in the fuzzy logic perform the same calculations as the AND and TABLE20.2 Fuzzy Variables 
OR operators in Boolean logic, respectively, see Table 20.1; for Operation Using Fuzzy Logic 
fuzzy variables, the MIN and MAX operators work as shown in 4 B  MIN(AB) MAX(A,B) 


Papier: 03 05 03 0.5 
0.3 07 0.3 0.7 
20.2.3 Defuzzification 06 «0.4 04 0.6 
06 8608 0.6 0.8 


Asa result of “MAX of MIN” operations in the Mamdani fuzzy sys- 

tems, a new set of fuzzy variables is generated, which later has to 

be converted to an analog output value by defuzzification blocks (Figure 20.1a). In the TSK fuzzy sys- 

tems, the defuzzification block was replaced with normalization and weighted average; M AX operations 

are not required, instead, a weighted average is applied directly to regions selected by MIN operators. 
Figure 20.4 shows the result surfaces using the TSK fuzzy architecture, with different membership 

functions. 


20.3 Neuro-Fuzzy System 


Alotofresearchisdevotedtoimprovetheability offuzzysystems[WJ96,DGKW02,GN W08,MW01,0OW99], 
such as evolutionary strategy and neural networks [CW94]. The combination of fuzzy logic and neural 
networks is called neuro-fuzzy system, which is supposed to result in a hybrid intelligent system by 
combining the human-like reasoning style of neural networks. 


20.3.1 Structure One 


Figure 20.5 shows the neuro-fuzzy system which attempts to present a fuzzy system in a form of neural 
network [RH99]. 
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FIGURE 20.4 Control surface using TSK fuzzy systems and five membership functions per input: (a) trapezoidal 
membership function, error = 2.4423, and (b) triangular membership function, error = 1.5119. 
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FIGURE 20.5 Neuro-fuzzy system. 


The neuro-fuzzy system consists of four blocks: fuzzification, multiplication, summation, and divi- 
sion. The Fuzzification block translates the input analog signals into fuzzy variables by membership 
functions. Then, instead of MIN operations in classic fuzzy systems, product operations (signals are 
multiplied) are performed among fuzzy variables. This neuro-fuzzy system with product encoding is 
more difficult to implement [OW96], but it can generate a slightly smoother control surface (see Figures 
20.6 and 20.7). The summation and division layers perform defuzzification translation. The weights on 
upper sum unit are designed as the expecting values (both the Mamdani and TSK rules can be used); 
while the weights on the lower sum unit are all “1.” 

Figures 20.6 and 20.7 show the surfaces obtained using the neuro-fuzzy system in Figure 20.5, which 
is smoother than the surfaces in Figures 20.3 and 20.4. 

Note that, in this type of neuro-fuzzy systems, only architecture resembles neural networks because 
cells there perform different functions than neurons, such as signal multiplication or division. 
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FIGURE 20.6 Control surface using neuro-fuzzy system in Figure 20.5, Mamdani rule for weight initialization of 
the upper sum unit and five membership functions per input: (a) trapezoidal membership function, error = 3.8723 
and (b) triangular membership function, error = 2.4468. 
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FIGURE 20.7 Control surface using neuro-fuzzy system in Figure 20.5, TSK rule for weight initialization of the 
upper sum unit and five membership functions per input: (a) trapezoidal membership function, error = 2.4423 and 
(b) triangular membership function, error = 1.3883. 


20.3.2 Structure Two 


A single neuron can divide input space by line, plane, or hyper plane, depending on the problem dimen- 
sionality. In order to select just one region in n-dimensional input space, more than (n+1) neurons are 
required. For example, to separate a rectangular pattern, four neurons are required, as is shown in 
Figure 20.8. If more input clusters should be selected, then the number of neurons in the hidden layer 
should be properly multiplied. If the number of neurons in the hidden layer is not limited, then all 
classification problems can be solved using the three-layer network. 

With the concept shown in Figure 20.8, fuzzifiers and MIN operators used for region selection can be 
replaced by simple neural network architecture [XY W10]. In this example, the two analog inputs, each 
with five membership functions, can be organized as a two-dimensional input space was divided by six 
neurons horizontally (from line a to line f) and by six neurons vertically (from line g to line /), as shown 
in Figure 20.9. The corresponding neural network is shown in Figure 20.10. Neurons in the first layer 
are corresponding to the lines indexed from a to I. Each neuron is connected only to one input. For each 
neuron input, weight is equal to +1 and the threshold is equal to the value of the crossing point on the 
x or y axis. The type of activation functions of neurons in the first layer decides the type of membership 
functions of the fuzzy system, as shown in Figure 20.11. 
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FIGURE 20.8 Separation of the rectangular area on a two dimensional input space and desired neural network 
to fulfill this task. 


FIGURE 20.9 Two-dimensional input plane separated vertically and horizontally by six neurons in each direction. 


Neurons in the second layers are corresponding to the sections indexed from 1 to 25. Each of them has 
two connections to lower boundary neurons with weights of +1 and two connections to upper boundary 
neurons with weights of —1. Thresholds for all these neurons in the second layer are set to 3. 

Weights of the upper sum unit in the third layer have values corresponding to the specified values 
in the selected areas. The specified values can be obtained from either the fuzzy table (by Mamdani 
rule), or the expected function values (by TSK rule). Weights of the lower sum unit are equal to “1.” 
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FIGURE 20.10 ‘The neural network performing the function of fuzzy system. 
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FIGURE 20.11 Construction of membership functions by neurons’ activation functions: (a) trapezoidal membership 
function and (b) triangular membership function. 


All neurons in Figure 20.8 have a unipolar activation function and if the system is properly designed, 
then for any input vector in certain areas only the neuron of this area produces +1 while all remain- 
ing neurons have zero values. In the case of when the input vector is close to a boundary between 
two or more regions, then all participating neurons are producing fractional values and the system 
output is generated as a weighted sum. The fourth layer performs such a calculation: the upper sum 
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FIGURE 20.12 Control surface using neuro-fuzzy system in Figure 20.10: (a) using combination of activation 
functions in Figure 20.1la, error = 2.4423 and (b) using combination of activation functions in Figure 20.11b, 
error = 1.3883. 


divided by the lower sum. Like the neuro-fuzzy system in Figure 20.5, the last two layers are used for 
defuzzification. 

Using this concept of the neuro-fuzzy system, the result surfaces with different combination of acti- 
vation functions, can be obtained as shown in Figure 20.12. 

It was shown above that a simple neural network of Figure 20.10 can replace a TSK neuro-fuzzy 
system in Figure 20.5. All parameters of this network are directly derived from requirements specified 
for a fuzzy system and there is no need for a training process. 


20.4 Conclusion 


The chapter introduced two types of neuro-fuzzy architectures, in order to improve the performance of 
classic fuzzy systems. Based on a given example, the classic fuzzy systems and the neuro-fuzzy systems, 
with different settings, are compared. From the comparison results, the following conclusions can be 
drawn: 


¢ Inthe same type of fuzzy system, using triangular membership functions can get better results 
than those from using the same number of trapezoidal membership functions. 

¢ With the same membership function, the TSK (Takagi, Sugeno, and Kang) fuzzy systems perform 
more accurate calculation than the Mamdani fuzzy system. 

¢ The neuro-fuzzy system in Figure 20.5 makes a slight improvement on the accuracy, with the cost 
of using signal multiplication units, which are difficult for hardware implementation. 

¢ The neuro-fuzzy system in Figure 20.10 does the same job as the neuro-fuzzy system with the TSK 
rule in Figure 20.5. 
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21.1 Introduction 


Fuzzy control is regarded as the most widely used application of fuzzy logic [Mendel 2001]. A fuzzy logic 
controller (FLC) is credited with being an adequate methodology for designing robust controllers that 
are able to deliver a satisfactory performance in the face of uncertainty and imprecision. In addition, 
FLCs provide a way of constructing controller algorithms by means of linguistic labels and linguistically 
interpretable rules in a user-friendly way closer to human thinking and perception. 

FLCs have successfully outperformed the traditional control systems (like PID controllers) and have 
given a satisfactory performance similar (or even better) to the human operators. According to Mamdani 
[Mamdani 1994]: “When tuned, the parameters of a PID controller affect the shape of the entire con- 
trol surface. Because fuzzy logic control is a rule-based controller, the shape of the control surface can 
be individually manipulated for the different regions of the state space, thus limiting possible effects to 
neighboring regions only.” FLCs have been applied with great success to many applications, where the 
first FLC was developed in 1974 by Mamdani and Assilian for controlling a steam generator [Mamdani 
1975]. In 1976, Blue Circle Cement and SIRA in Denmark developed a cement kiln controller—which is 
the first industrial application of fuzzy logic. The system went to operation in 1982 [Holmblad 1982]. In 
the 1980s, several important industrial applications of fuzzy logic were launched successfully in Japan, 
where Hitachi put a fuzzy logic based automatic train control system into operation in Sendai city’s sub- 
way system in 1987 [Yasunobu 1985]. Another early successful industrial application of fuzzy logic is a 
water-treatment system developed by Fuji Electric [Yen 1999]. These and other applications motivated 
many Japanese engineers to investigate a wide range of novel fuzzy logic applications. This led to the fuzzy 
boom in Japan which was a result of close collaboration and technology transfer between universities 
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and industries where large-scale national research initiatives (like the Laboratory for International 
Fuzzy Engineering Research [LIFE]) were established by Japanese government agencies [Yen 1999]. In 
late January 1990, Matsushita Electric Industrial Co. introduced their newly developed fuzzy controlled 
automatic washing machine and launched a major commercial campaign for the “fuzzy” product. This 
campaign turns out to be a successful marketing effort not only for the product, but also for the fuzzy 
logic technology [Yen 1999]. Many other home electronics companies followed Matsushita’s approach 
and introduced fuzzy vacuum cleaners, fuzzy rice cookers, fuzzy refrigerators, fuzzy camcorders, and 
others products. As a result, the consumers in Japan recognized the Japanese word “fuzzy,” which won the 
gold prize for the new word in 1990 [Hirota 1995]. This fuzzy boom in Japan triggered a broad and serious 
interest in this technology in Korea, Europe, and the United States. Boeing, NASA, United Technologies, 
and other aerospace companies have developed FLCs for space and aviation applications [Munakata 
1994]. Other control applications include control of alternating current induction motors, engine spark 
advance control, control of autonomous robots, and many other applications [Yen 1999]. Following this, 
the recent years have witnessed a wide-scale deployment of FLCs to numerous successful applications 
[Langari 1995, Yen 1999]. 

However, there are many sources of uncertainty facing the FLC in dynamic real-world unstructured 
environments and many real-world applications; some of the uncertainty sources are as follows: 


¢ Uncertainties in inputs to the FLC, which translate into uncertainties in the antecedents’ 
membership functions (MFs) as the sensors measurements are affected by high noise levels from 
various sources. In addition, the input sensors can be affected by the conditions of observation 
(ie., their characteristics can be changed by the environmental conditions such as wind, sun- 
shine, humidity, rain, etc.). 

¢ Uncertainties in control outputs, which translate into uncertainties in the consequents’ MFs of 
the FLC. Such uncertainties can result from the change of the actuators’ characteristics, which 
can be due to wear, tear, environmental changes, etc. 

¢ Linguistic uncertainties as the meaning of words that are used in the antecedents’ and conse- 
quents’ linguistic labels can be uncertain, as words mean different things to different people 
[Mendel 2001]. In addition, experts do not always agree and they often provide different conse- 
quents for the same antecedents. A survey of experts will usually lead to a histogram of possibili- 
ties for the consequent of a rule; this histogram represents the uncertainty about the consequent 
of a rule [Mendel 2001]. 

¢ Uncertainties associated with the change in the operation conditions of the controller. Such 
uncertainties can translate into uncertainties in the antecedents’ and/or consequents’ MFs. 

¢ Uncertainties associated with the use of noisy training data that could be used to learn, tune, or 
optimize the FLC. 


All of these uncertainties translate into uncertainties about fuzzy-set MFs [Mendel 2001]. The vast 
majority of the FLCs that have been used to date were based on the traditional type-1 FLCs. However, 
type-1 FLCs cannot fully handle or accommodate the linguistic and numerical uncertainties associated 
with dynamic unstructured environments, as they use type-1 fuzzy sets. Type-1 fuzzy sets handle the 
uncertainties associated with the FLC inputs and outputs by using precise and crisp MFs that the user 
believes capture the uncertainties. Once the type-1 MFs have been chosen, all the uncertainty disap- 
pears because type-1 MFs are precise [Mendel 2001]. The linguistic and numerical uncertainties associ- 
ated with dynamic unstructured environments cause problems in determining the exact and precise 
antecedents’ and consequents’ MFs during the FLC design. Moreover, the designed type-1 fuzzy sets can 
be suboptimal under specific environment and operation conditions; however, because of the environ- 
ment changes and the associated uncertainties, the chosen type-1 fuzzy sets might not be appropriate 
anymore. This can cause degradation in the FLC performance, which can result in poor control and 
inefficiency and we might end up wasting time in frequently redesigning or tuning the type-1 FLC so 
that it can deal with the various uncertainties. 
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A type-2 fuzzy set is characterized by a fuzzy MF, i.e., the membership value (or membership grade) 
for each element of this set is a fuzzy set in [0,1], unlike a type-1 fuzzy set where the membership grade 
is a crisp number in [0,1] [1]. The MFs of type-2 fuzzy sets are three dimensional (3D) and include a 
footprint of uncertainty. It is the new third dimension of type-2 fuzzy sets and the footprint of uncer- 
tainty that provide additional degrees of freedom that make it possible to directly model and handle 
uncertainties [Mendel 2001]. The type-2 fuzzy sets are useful where it is difficult to determine the exact 
and precise MFs. Type-2 FLCs that use type-2 fuzzy sets have been used to date with great success where 
the type-2 FLCs have outperformed their type-1 counterparts in several applications where there is high 
level of uncertainty [Hagras 2007b]. 

In the next section, we will introduce the type-2 fuzzy sets and their associated terminologies. Section 
21.3 introduces briefly the interval type-2 FLC and its various components. Section 21.4 provides a 
practical example to clarify the various operations of the type-2 FLC. Finally, conclusions and future 
directions are presented in Section 21.4. 


21.2 Type-2 Fuzzy Sets 


Type-1 FLCs employ crisp and precise type-1 fuzzy sets. For example, consider a type-1 fuzzy set repre- 
senting the linguistic label of “Low” temperature in Figure 21.1a: if the input temperature x is 15°C, then 
the membership of this input to the “Low” set will be the certain and crisp membership value of 0.4. 
However, the center and endpoints of this type-1 fuzzy set can vary due to uncertainties (which could 
arise, e.g., from noise) in the measurement of temperature (numerical uncertainty) and in the situa- 
tions in which 15°C could be called low (linguistic uncertainty) (in the Arctic 15°C might be considered 
“High,” while in the Caribbean it would be considered “Low’”). If this linguistic label was employed with 
a fuzzy logic controller, then the type-1 FLC would need to be frequently tuned to handle such uncer- 
tainties. Alternatively, one would need to have a group of separate type-1 sets and type-1 FLCs where 
each FLC will handle a certain situation. 

On the other hand, a type-2 fuzzy set is characterized by a fuzzy membership function (MF), i.e., the 
membership value (or membership grade) for each element of this set is itself a fuzzy set in [0,1]. For 
example, if the linguistic label of “Low” temperature is represented by a type-2 fuzzy set as shown in 
Figure 21.1b, then the input x of 15°C will no longer have a single value for the MF. Instead, the MF takes 
on values wherever the vertical line intersects the area shaded in gray. Hence, 15°C will have primary 
membership values that lie in the interval [0.2, 0.6]. Each point of this interval will have also a weight 
associated with it. Consequently, this will create an amplitude distribution in the third dimension to 
form what is called a secondary MF, which can be a triangle as shown in Figure 21.1c. In case the second- 
ary MF is equal to 1 for all the points in the primary membership and if this is true for Vx € X, we have 
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FIGURE 21.1 (a) A type-1 fuzzy set. (b) A type-2 fuzzy set—primary MF. (c) An interval type-2 fuzzy set second- 
ary MF (drawn with dotted lines) and a general type-2 MF (solid line) at a specific point x’. (d) 3D view of a general 
type-2 fuzzy set. 


© 2011 by Taylor and Francis Group, LLC 


21-4 Intelligent Systems 


the case of an interval type-2 fuzzy set. The input x of 15°C will now have a primary membership and 
an associated secondary MF. Repeating this for all x € X, creates a 3D MF (as shown in Figure 21.1d)—a 
type-2 MF—that characterizes a type-2 fuzzy set. The MFs of type-2 fuzzy sets are 3D and include a 
footprint of uncertainty (FOU) (shaded in gray in Figure 21.1b). It is the new third-dimension of type-2 
fuzzy sets and the FOU that provide additional degrees of freedom and that make it possible to directly 
model and handle the numerical uncertainties and linguistic uncertainties. 


21.2.1 Type-2 Fuzzy Set Terminologies and Operations 


A type-2 fuzzy set A is characterized by a type-2 MF [1,(x, u) [Mendel 2001] where x € X and ue J,c 
[0, 1], ie., 


A={((x,u), Ma(x,u))|VxeX, Vue J, €[0,1]} (21.1) 


in which 0 <u,(x,u) < 1. A can also be expressed as follows [Mendel 2001]: 


A= ff uicowrew J. cto (21.2) 


xeEX uesy 


where Jf denotes union over all admissible x and u. For discrete universes of discourse J is replaced by Z 
[Mendel 2001]. 

At each value of x say x =x’, the two-dimensional plane whose axes are u and [1;(x’, u) is called a verti- 
cal slice of U;(x, u). A secondary MF is a vertical slice of 4(x, u). It is uz(x = x, u) for xe X and Vue 
J € [0,1] [Mendel 2001], i-e., 


Me=4SuG)= J Feu(u) Tyr [0,1] (21.3) 


ue] x’ 


in which 0 <f,,(u) < 1. Because Vx’ € X, the prime notation on U,(x’) is dropped and we refer to U1; (x) as 
a secondary MF [Mendel 2002a]; it is a type-1 fuzzy set which is also referred to as a secondary set. Many 
choices are possible for the secondary MFs. According to Mendel [Mendel 2001], the name that we use 
to describe the entire type-2 MF is associated with the name of the secondary MFs; so, for example, if 
the secondary MF is triangular then we refer to U,(x, wu) as a triangular type-2 MF. Figure 21.1c shows a 
triangular secondary MF at x’ which is drawn using the thick line. Based on the concept of secondary 
sets, type-2 fuzzy sets can be written as the union of all secondary sets [Mendel 2001]. 

The domain of a secondary MF is called primary membership of x [Mendel 2001]. In Equation 21.1, 
J,.is the primary membership of x, where J, < [0,1] for Vx € X [Mendel 2001]. 

When f,(u) = 1, Vu € J, C [0,1], then the secondary MFs are interval sets, and, if this is true for 
Vx € X, we have the case of an interval type-2 MF [Mendel 2001]. Interval secondary MFs reflect a uni- 
form uncertainty at the primary memberships of x. Figure 21.1c shows the secondary membership at x’ 
(drawn in dotted lines in Figure 21.1c) in case of interval type-2 fuzzy sets. 


21.2.1.1 Footprint of Uncertainty 


The uncertainty in the primary memberships of a type-2 fuzzy set A, consists of a bounded region that 
is called the footprint of uncertainty (FOU) [Mendel 2002a]. It is the union of all primary memberships 
[Mendel 2002a], i.e., 


FOU(A) = U i, (21.4) 


xeX 
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The shaded region in Figure 21.1b is the FOU. It is very useful, because according to Mendel and John 
[Mendel 2002al, it not only focuses our attention on the uncertainties inherent in a specific type-2 MF, 
whose shape is a direct consequence of the nature of these uncertainties, but it also provides a very con- 
venient verbal description of the entire domain of support for all the secondary grades of a type-2 MF. 
The shaded FOU implies that there is a distribution that sits on top of it—the new third dimension of 
type-2 fuzzy sets. What that distribution looks like depends on the specific choice made for the second- 
ary grades. When they all equal one, the resulting type-2 fuzzy sets are called interval type-2 fuzzy sets. 
Establishing an appropriate FOU is analogous to establishing a probability density function (pdf) in a 
probabilistic uncertainty situation [Mendel 2001]. The larger the FOU, the more uncertainty there is. 
When the FOU collapses to a curve, then its associated type-2 fuzzy set collapses to a type-1 fuzzy set, 
in much the same way that a pdf collapses to a point when randomness disappears. Recently, it has been 
shown that regardless of the choice of the primary MF (triangle, Gaussian, trapezoid), the resulting FOU 
is about the same [Mendel 2002b]. According to Mendel and Wu [Mendel 2002b], the FOU of a type-2 
MF also handles the rich variety of choices that can be made for a type-1 MF, ie., by using type-2 fuzzy 
sets instead of type-1 fuzzy sets, the issue of which type-1 MF to choose diminishes in importance. 


21.2.1.2 Embedded Fuzzy Sets 


For continuous universes of discourse X and U, an embedded type-2 set A, is defined as follows 
[Mendel 2001]: 


re J [flulullx we, CU =[0,1) (21.5) 


xex 


Set A, is embedded in A and there is an uncountable number of embedded type-2 sets is A [Mendel 
2002b]. For discrete universes of discourse X and U, an embedded type-2 set A, has N elements, where A, 
contains exactly one element from JxJx»---Jxy, namely u,, U>, ...Uy, each with its associated secondary 
grade fr, (U1)> fey (U2), «+» fry (Un) [Mendel 2001], i-e., 


A,= DY Ufesuadlualisa ta Tey EU =(0,1] (21.6) 


d=1 


Set A, is embedded in A and there is a total of IL. MA, [23], where M, is the discretization levels of 
uj, at each x, a 

For continuous universes of discourse X and U, an embedded type-1 set A, is defined as follows 
[Mendel 2002a] 


A, J u/lx uejJ, CU=[0,1] (21.7) 


xexX 


Set A, is the union of all the primary memberships of set A, in Equation 21.5 and there is an uncountable 
number of A,. 

For discrete universes of discourse X and U an embedded type-1 set A, has N elements, one each from 
JJ cy++ Tx» namely u,, Uy, ...Uy, [Mendel 2002b], i.e., 


N 
A= Sala Ug € Jn) CU =[0,1] (21.8) 
d=1 


N 
There is a total of | | M,A, [Mendel 2002a]. 
d=1 
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FIGURE 21.2 (a) Three type-1 fuzzy sets representing an input to the FLC. (b) The three type-1 fuzzy sets in 
Figure 2.2a are embedded in the Low type-2 fuzzy set. 


It has proven by Mendel and John [Mendel 2002a] that a type-2 fuzzy set A can be represented as the 
union of its type-2 embedded sets, i-e., 


n” N 
A= ye where n” = [[% (21.9) 
I=1 d=1 


Figure 21.2a shows three type-1 fuzzy sets (Very Very Low, Very Low, and Low) used to express in detail the 
different fuzzy levels of Low for an input to the FLC. In Figure 21.2b notice that the type-1 fuzzy sets for Very 
Very Low, Very Low, and Low are embedded in the interval type-2 fuzzy set Low, not only this but there is 
a large number of other embedded type-1 fuzzy sets (uncountable for continuous universes of discourse). 


21.2.1.3 Interval Type-2 Fuzzy Sets 


In Equation 21.3 when f,(u) = 1, Vue J, C [0,1], then the secondary MFs are interval sets, and, if this is 
true for Vx € X, we have the case of an interval type-2 ME, which characterizes the interval type-2 fuzzy 
sets. Interval secondary MFs reflect a uniform uncertainty at the primary memberships of x. Interval 
type-2 sets are very useful when we have no other knowledge about secondary memberships [Liang 
2000]. The membership grades of the interval type-2 fuzzy sets are called “interval type-1 fuzzy sets.” 
Since all the memberships in an interval type-1 set are unity, in the sequel, an interval type-1 set is 
represented just by its domain interval, which can be represented by its left and right end-points as [1,7] 
[Liang 2000]. The two end-points are associated with two type-1 MFs that are referred to as lower and 
upper MFs (U4 (x), 4 (x)) [Liang 2000]. 

The upper and lower MFs are two type-1 MEFs which are bounds for the footprint of uncertainty FOU 
(A) of a type-2 fuzzy set A. 

According to Mendel [Mendel 2001], we can re-express Equation 21.3 as follows to represent the 
interval type-2 fuzzy set A in terms of upper and lower MFs as follows: 


A= J J l/u} /x (21.10) 
xeX} uelpt, (x) Ha (x)] 


For type-2 fuzzy sets, there are new operators named the meet and join to account for the intersection 
and union, respectively. Liang and Mendel [Liang 2000] had derived the expressions for meet and join 
in interval type-2 fuzzy sets in which we need to compute the join, meet of secondary MFs which are 
type-1 interval fuzzy sets. 
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Let r={ 1/v be an interval type-1 set with domain [I,,r,] ¢ [0,1] and c-| l/w_ is another 
veF weG 


interval type-1 set with domain [1 rl c [0,1]. 
The meet Q between F and G under the product t-norm which is used in our type-2 FLC is written as 
follows [Linag 2000]: 


Q=FMNG= J 1/q (21.11) 


gcll fly srptg] 


From Equation 21.11 each term in FM G is equal to the product of v.w for some v € F and w €G in which 
the smallest term being /,/, and the largest is r,r,. Since both F and G have continuous domains, F 1G 
has a continuous domain, therefore, FG is an interval type-1 set with domain [I,I,, r,r,] [Liang 2000]. 
In a similar manner, the meet under product t-norm of n interval type-1 sets F,,..., F, having domains 


[lr], ..-[1,,7,], respectively, is an interval set with domain with domain r l,, a . The meet 
o=l o=l 


under minimum t-norm is calculated in a similar manner [Liang 2000]. 
The join between F and G is given by 


Q=FUG J 1/q (21.12) 


qelly V1 og orf Vr] 


where q = v Vv w, where v denotes the maximum operation used in our type-2 FLC. The join of n inter- 
val type-1 sets F,,..., F,, having domains [I,,r,], ...[l,17,], respectively, is an interval set with domain 
(Lv Lyv...V1,), vw nV..V1,)], Le. with domain equal {max(/,, 1,,..., l,), max(r,,7,...,7,,)] [18]. 

After reviewing the definition of the type-2 fuzzy sets and their associated terminologies, we can real- 
ize that using type-2 fuzzy sets to represent the inputs and outputs of a FLC has many advantages when 
compared to the type-1 fuzzy sets, we summarize some of these advantages as follows: 


« As the type-2 fuzzy sets MFs are fuzzy and contain a FOU, they can model and handle the lin- 
guistic and numerical uncertainties associated with the inputs and outputs of the FLC. Therefore, 
FLCs that are based on type-2 fuzzy sets will have the potential to produce a better performance 
than the type-1 FLCs when dealing with uncertainties [Hagras 2004]. 

¢ Using type-2 fuzzy sets to represent the FLC inputs and outputs will result in the reduction of the 
FLC rule base when compared to using type-1 fuzzy sets, as the uncertainty represented in the 
FOU of the type-2 fuzzy sets lets us cover the same range as type-1 fuzzy sets with a smaller num- 
ber of labels and the rule reduction will be greater when the number of the FLC inputs increases 
[Mendel 2001, Hagras 2004]. 

¢ Each input and output will be represented by a large number of type-1 fuzzy sets, which are 
embedded in the type-2 fuzzy sets [Mendel 2001, Hagras 2004]. The use of such a large number of 
type-1 fuzzy sets to describe the input and output variables allows for a detailed description of the 
analytical control surface as the addition of the extra levels of classification give a much smoother 
control surface and response. In addition, the type-2 FLC can be thought of as a collection of 
many different embedded type-1 FLCs [Mendel 2001]. 

¢ It has been shown in [Wu 2005] that the extra degrees of freedom provided by the FOU enables 
a type-2 FLC to produce outputs that cannot be achieved by type-1 FLCs with the same number 
of MFs. It has been shown that a type-2 fuzzy set may give rise to an equivalent type-1 member- 
ship grade that is negative or larger than unity. Thus, a type-2 FLC is able to model more com- 
plex input-output relationships than its type-1 counterpart and, thus, can give better control 
response. 
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FIGURE 21.3 (a) Control surface of a robot type-2 FLC with 4 rules. (b) Control surface of a robot type-1 FLC with 
4 rules. (c) Control surface of a robot type-1 FLC with 9 rules. (d) Control surface of a robot type-1 FLC with 25 rules. 


The above points could be shown in Figure 21.3 which shows for an outdoor mobile robot how a type-2 
FLC with a rule base of only four rules could produce a smoother control surface as shown in Figure 
21.3a and hence better result than its type-1 counterpart that used a rule base of 4, 9, and 25 rules as 
shown in Figure 21.3b through d, respectively [Hagras 2004]. It is also shown that as the type-1 FLC 
rule base increases, its response approaches that of the type-2 FLC, which encompasses a huge number 
of embedded type-1 FLCs. 


21.3 Interval Type-2 FLC 


The interval type-2 FLC uses interval type-2 fuzzy sets to represent the inputs and/or outputs of the 
FLC. The interval type-2 FLC is a special case of the general type-2 FLC. The vast majority of type-2 FLC 
applications to date employ the interval type-2 FLC. This is because the general type-2 FLC is computa- 
tionally intensive and the computation simplifies a lot when using the interval type-2 FLC, which will 
enable us to design a FLC that operates in real time. 
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FIGURE 21.4 Type-2 FLC. 


The interval type-2 FLC is depicted in Figure 21.4 and it consists of a Fuzzifier, Inference Engine, Rule 
Base, Type-reducer, and Defuzzifier. The interval type-2 FLC operate as follows: the crisp inputs from 
the input sensors are first fuzzified into input type-2 fuzzy sets; singleton fuzzification is usually used 
in interval type-2 FLC applications due to its simplicity and suitability for embedded processors and 
real-time applications. The input type-2 fuzzy sets then activate the inference engine and the rule base 
to produce output type-2 fuzzy sets. The type-2 FLC rules will remain the same as in the type-1 FLC, but 
the antecedents and/or the consequents will be represented by interval type-2 fuzzy sets. The inference 
engine combines the fired rules and gives a mapping from input type-2 fuzzy sets to output type-2 fuzzy 
sets. The type-2 fuzzy outputs of the inference engine are then processed by the type-reducer, which 
combines the output sets and performs a centroid calculation that leads to type-1 fuzzy sets called the 
type-reduced sets. The type-reduction process uses the iterative Karnik-Mendel (KM) procedure to cal- 
culate the type-reduced fuzzy sets [Mendel 2001]. The KM procedure convergence is proportional to the 
number of fired rules. After the type-reduction process, the type-reduced sets are then defuzzified (by 
taking the average of the type-reduced set) to obtain crisp outputs that are sent to the actuators. More 
information about the interval type-2 FLC can be found in [Hagras 2004]. The following sections will 
give a brief overview of the interval type-2 FLC. 


21.3.1 Fuzzifier 


The fuzzifier maps a crisp input vector with p inputs x = (x),....x,)"€ X, x X,... x X,=X into input fuzzy 
sets, these fuzzy sets can, in general, be type-2 fuzzy input sets A, [13], [17]. However, we will use single- 
ton fuzzification as it is fast to compute and thus suitable for the robot real-time operation. In the singleton 
fuzzification, the input fuzzy set has only a single point of nonzero membership, ie., A, is a type-2 fuzzy 
singleton if u;,(x)=1/1 for x =x’ and 1;,(x) = 1/0 for all other x # x’ [Mendel 2001]. 


21.3.2 Rule Base 


The rules will remain the same as in type-1 FLC but the antecedents and the consequents will be rep- 
resented by interval type-2 fuzzy sets. Consider an interval type-2 FLC having p inputs x, € X),..., 
x, € X,andc outputs y, € Y,,...,y,€Y,. The ith rule in this multi input multi output (MIMO) FLC can 
be written as follows: 


Riamo 11 x; is Fi and ... and x, is FE THEN y, is Gi... is Gi i=1,...M (21.13) 


where M is the number of rules in the rule base. 
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21.3.3 Fuzzy Inference Engine 


The inference engine combines rules and gives a mapping from input type-2 sets to output type-2 sets. 
In the inference engine, multiple antecedents in the rules are connected using the Meet operation, the 
membership grades in the input sets are combined with those in the output sets using the extended sup- 
star composition, multiple rules are combined using the Join operation. In our interval type-2 FLC, we 
will use the meet under product ¢-norm so the result of the input and antecedent operations, which are 
contained in the firing set ?_ju it (x/,) = F'(x’), is an interval type-1 set, as follows [Mendel 2001]: 


Pa)=[ FOF EIS FI (21.14) 
where f '(x’) and f (x) can be written as follows, where * denotes the product or the minimum 
operations, 

f= py (21) ++ hp (x5) (21.15) 
= ral ay 
FD) = Bg (xi) «ps (ep) (21.16) 


21.3.4 Type Reduction 


Type reduction is the operation that takes us from the type-2 output sets of the inference engine to a 
type-1 set that is called the “the type-reduced set.” These type-reduced sets are then defuzzified to obtain 
crisp outputs that are sent to the outputs of the FLC. 

The calculation of the type-reduced sets is divided into two stages; the first stage is the calculation 
of centroids of the type-2 interval consequent sets of each rule, which is conducted ahead of time and 
before starting the FLC operation. For each output, we determine the centroids of all the output type-2 
interval fuzzy sets representing this output, then the centroid of the type-2 interval consequent set for 
the ith rule will be one of the pre-calculated centroids of type-2 output sets, which corresponds to the 
rule consequent. To calculate the centroids of the output interval type-2 fuzzy sets, we used the KM 
iterative procedure as explained in [Mendel 2001]. 

The second stage of type reduction happens each control cycle to calculate the type-reduced sets. For 
any output k in order to compute the type-reduced set, we need to compute its two end points y,, and y,, 
which can be found using the KM iterative procedures [Mendel 2001, Liang 2000]. 


21.3.5 Defuzzification 


From the type-reduction stage, we have for each output a type-reduced set Y,,,(x), determined by its 
left-most point y, and right-most point y,,. We defuzzify the interval set by using the average of y, and 
yx hence the defuzzified crisp output for each output k is 


Y, (x)= ara (21.17) 


21.4 Illustrative Example to Summarize the Operation 
of the Type-2 FLC 


In this section, we summarize the operation of the type-2 FLC through an example of a type-2 FLC 
that realizes the right-edge following behavior for an outdoor robot. The objective of this behavior is 
to follow an edge to the right of the robot at a desired distance. This type-2 FLC will have two inputs 
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FIGURE 21.5 Pictorial description of the input and antecedent operations for a robot type-2 FLC. 


from the two right-side sonar sensors, the first input is from the right-side front sensor (RSF) and the 
second input is from the right-side back sensor (RSB). The RSF crisp input will be denoted by x, and 
the RSB crisp input will be denoted by x,. The type-2 FLC controls two outputs, which are the robot 
speed denoted by y, and the robot steering denoted by y,. Each input will be represented only by two 
type-2 fuzzy sets, which are Near and Far as shown in Figure 21.5. The output robot speed will be rep- 
resented by three type-2 fuzzy sets, which are Slow, Medium, and Fast while the robot steering will be 
represented by two type-2 fuzzy sets, which are Left and Right. In what follows, we will follow a crisp 
input vector through the various components of the type-2 FLC until we get crisp output signals to 
the robot actuators. The crisp input vector will consist of two crisp inputs, the first one represents the 
reading of RSF which we term x; and the second input represents the reading of RSB which we term 
xj, the type-2 FLC crisp outputs corresponding to x/ and x} are y; for the robot speed and y3 for the 
robot steering. 


21.4.1 Fuzzification 


Figure 21.5 shows a pictorial description of the input and antecedent operations in Equations 21.14 
through 21.16 in our type-2 FLC. In the fuzzification stage, as we are using singleton fuzzification, each 
input is matched against its MFs to calculate the upper and lower membership values for each fuzzy set. 
The input x; of the RSF is matched against its MF in Figure 21.5a and it was found that the lower mem- 
bership value for the Near type-2 fuzzy set is 0.35 while the upper membership value is 0.85. For the Far 
fuzzy set, the lower membership value is 0.15 while the upper membership value is 0.75. The input x; of 
the RSB is matched against its MF in Figure 21.5b and it was found that the lower membership value for 
the Near fuzzy set is 0.1 and the upper membership value is 0.7. For the Far fuzzy set, the lower member- 
ship value is 0.3 while upper membership value is 0.8. 
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21.4.2 Rule Base 


The rule base for this type-2 FLC is shown in Table 21.1, where p (the number of inputs) is 2 and c (the 
number of outputs) is 2. Any MIMO rule from Table 21.1 can be written according to Equation 21.13, for 
example rule (1) can be written as follows: 


R\auo: IF x, is F! and x, is THEN y, is Gi, y, is G! 


where 
F! is the Near type-2 fuzzy set for RSF 
F. is the Near type-2 fuzzy set for RSB 
G} is the Slow output type-2 fuzzy set for the robot speed 
G} is the Left output type-2 fuzzy set for the robot steering 


In the fuzzy inference engine, we need to calculate the firing aon of poch rule. According to 
Equation 21.14, the firing strength | of each rule is an interval type-1 set [ f', r ‘] where f' is calculated 
according to Equation 21.15 and 7, is calculated according to Equation 21.16. Note that in our FLC, we 
use the meet under the product t-norm. So for rule (1), we can calculate ri and ras as follows: 


fi= Hy (Xi) - Hp (23) = 0.35*0.1 = 0.035 


where Hin (xt) is the lower membership value for x/ for the Near type-2 fuzzy set for RSF, which is 0.35 


and LL. Hin (x3) is the lower membership value for x; for the Near type-2 fuzzy set for RSB which is 0.1, these 


membership values were calculated before in the fuzzification stage. 
In a similar manner by using the upper membership values, we can calculate 7 as follows: 


f'= Ha (x1). Hg (x4) = 0.85 + 0.7 = 0.595. 
In a similar manner, we can calculate the rest of the firing strengths for all the rules as follows: 
7 oi jg (1) Hg (2) = 0.35 * 0.3 = 0.105, f? =H (x1): Hy (x2) = 0.85 + 0.8= 0.68 
f= > (Xf) Hy (x3) = 0.154 0.1= 0.015, FP = Bjo (21) -Hg (x4) = 0.75* 0.7 = 0.525 


f = Wye (07) Hi O02) = 0.15+0.3=0.045, f* = Hye (x1) Hyg (x5) = 0.75 + 0.8 = 0.6 


TABLE 21.1 Example Rule Base of a Right-Edge 
Following Behavior Implemented by a Type-2 FLC 


Rule Number RSF RSB Speed Steering 
1 Near Near Slow Left 

2 Near Far Slow Left 

3 Far Near Medium Right 
4 Far Far Fast Right 


© 2011 by Taylor and Francis Group, LLC 


Introduction to Type-2 Fuzzy Logic Controllers 21-13 


21.4.3 Type Reduction 


21.4.3.1 Calculating the Centroids of the Rule Consequents 


In this stage, we need to calculate for each output the centroids of all the output type-2 fuzzy sets, so 
that we can calculate the centroid of the consequent of each rule, which will be one of the centroids of 
the output fuzzy sets that corresponds to the rule consequent. To calculate these centroids, we will use 
the iterative KM procedure explained in [Mendel 2001, Hagras 2004] using 100 sampling points. For 
each output, we calculate the centroids of all the type-2 fuzzy sets yjt =1,...T. For the speed output, the 
number of output fuzzy sets T = 3. Assume for illustrative purposes that the centroid of the Slow output 
fuzzy set is [0.43, 0.55], the centroid of the Medium output fuzzy set is [0.63, 0.76] and the centroid of the 
High output fuzzy set is [1.03, 1.58]. Next, we can determine the centroids of the rule consequents of the 
output speed yi as follows: 


y= Wie Yul = yi = La» Ya) = [0.43, 0.55] 


yt = yi» Ya = (0.63, 0.76], yt =L yn yi ]=[1.03, 1.58] 


For the output steering, the number of outputs fuzzy sets T = 2. The steering values are in percentage 
where right steering values are positive values and the left steering values are negative. Again for illustra- 
tive purposes assume that the centroid for the Left output fuzzy set is [56.8, 85.4] and the centroid for 
Right output fuzzy set is [-85.4,-56.8]. Next, we can determine the centroids of the rule consequents 2 
as follows: 


= Vis Ynl =¥3 =i. Yn] =[- 85.4, - 56.8], v3 =Lins Yl = 2 = [bs Yeo] = [56.8, 85.4] 


21.4.3.2 Calculating the Type-Reduced Set 


For each output k to compute the type-reduced Y.,,(x),, we need to compute its two end points y, and y,,. 
Using the iterative KM procedure for type reduction explained in [Mendel 2001, Liang 2000], we can 
determine switching pein Land R needed to calculate the type-reduced sets. For ve speed Guus we 
do not need to reorder Yn as they are already ordered in ascending order where yh Syn S$ yn S yi the 
same applies for yn. By using the iterative procedure in Figure 21.5, it was found that L = 2 so y, can be 
as follows: 


a u 


2 
= f yn t +y f'yn _ Font Pont fey Fyn 


Yn = 3 4 
Se re 
u= Ma 
_ 0.595 x 0.43 + 0.68 x 0.43+ 0.015 x 0.634 0.045 x 1.03 
~ 0.595+ 0.68 + 0.015+ 0.045 
= 0.452 
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To calculate y,,, we use the iterative KM procedure and it was found that R = 3, so y,, can be found by 
substituting in the following equation: 


pu © Fy y 1 2 3 iv 
ae: yn Doe Yn fe ynt fo Yat f yur fiyn 
3 Ae a = 1 2 3. $4 

pupae ey 
Di ee a ~ 


0.035 x 0.55+ 0.105 x 0.55 + 0.015 x 0.76+0.6 x 1.58 
7 0.035 + 0.105 + 0.015 +0.6 


Va = 


=1.37 


For the steering output, we do not need to reorder Yh as they are already ordered in ascending order 
where Yio S Yn S Yn $ yiz, the same applies for yn. By using the iterative procedure, it was found that 
L=2,s0 y, can be found as follows: 


2° = 4 
Dae: wi+ > Yin Pf yn th yat fyb + fyi 


Yan = 


2 a Gl wa 3 4 
pa > en ee me ke 
u=1— v=3 
_ 0.595 x —85.4+ 0.68 x — 85.4 +0.015 x 56.8 + 0.045 x 56.8 
0.595 + 0.68 + 0.015 + 0.045 
= —-79.01 


To calculate y,,, we use the KM procedure and it was found that R = 2, so y,, can be found by as follows: 


Eos Flynt ne fyn fat hyat Pyrat fy 
Yorey er pe aaek 


u=l1— v=3 


Vr 


_ 0.035 x —56.8 + 0.105 x — 56.8 + 0.525 X 85.4 +0.6 x 85.4 
7 0.035 + 0.105 +0.525+ 0.6 


= 69.66 


21.5 Defuzzification 


From the type-reduction stage, we have for each output k a type-reduced set; we defuzzify the interval 
set by calculating the average of y, and y,, using Equation 21.17 for both outputs as follows: 


yn+Yn _ 0.452 41.37 
2 


The speed crisp output y, = =0.911 m/s 


Yr +Yr2 _ —79.01+69.66 _ 
2 2 


4.675% 


And the steering crisp output y4 = 
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21.6 Conclusions and Future Directions 


In this chapter, we presented an introduction and brief overview to the interval type-2 FLC and we high- 
lighted its benefits, especially in highly uncertain environments. There have been recent work to avoid the 
computational overheads of the interval type-2 FLC and thus speeding its response to achieve satisfactory 
real-time performance. More information about these techniques could be found in [Hagras 2008]. 

It has been shown in various applications that as the level of imprecision and uncertainty increases, 
the type-2 FLC will provide a powerful paradigm to handle the high level of uncertainties present in 
real-world environments [Hagras 2004, 2007a, 2008; Lynch 2006; Melin 2003; Shu 2005; Wu 2004; 
Figueroa 2005]. It has been also shown in various applications that the type-2 FLCs have given very good 
and smooth responses that have always outperformed their type-1 counterparts. Thus, using a type-2 
FLC in real-world applications can be a better choice than type-1 FLCs since the amount of uncertainty 
in real systems most of the time is difficult to estimate. 

Current research has started to explore the general type-2 FLC. Recent research is looking at generat- 
ing general type-2 FLCs that embed a group of interval type-2 FLCs. This will enable building on the 
existing theory of interval type-2 FLC while exploring the power of general type-2 FLCs. 

Thus, with the latest developments in interval type-2 FLCs, we can see that type-2 FLC overcomes 
the limitations of type-1 FLCs and will present a way forward to fuzzy control and especially in highly 
uncertain environments, which includes most of the real-world applications. Hence, it is envisaged to 
see a wide spread of type-2 FLCs in many real-world application in the next decade. 
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22.1 Introduction 


Fuzzy pattern recognition has emerged almost at the time fuzzy sets came into existence. One of the first 
papers authored by Bellman-Kalaba—Zadeh [BKZ66] has succinctly highlighted the key aspects of the 
technology of fuzzy sets being cast in the setting of pattern recognition. Since then we have witnessed 
a great deal of developments with a number of comprehensive review studies and books [P90,K82]. The 
recent years saw a plethora of studies in all areas of pattern recognition including the methodology, 
algorithms, and case studies [T73,R78,D82,LMT02]. 

Our key objective is to discuss the main aspects of the conceptual framework and algorithmic under- 
pinnings of fuzzy pattern recognition. The key paradigm of pattern recognition becomes substantially 
augmented by the principles of fuzzy sets. There are new developments that go far beyond the traditional 
techniques of pattern recognition and bring forward novel concepts and architectures that have not 
been contemplated so far. 

It is assumed that the reader is familiar with the basic ideas of pattern recognition and fuzzy sets; 
one may consult a number of authoritative references in these areas [DHS01,F90,PG07,H99,B81]. As 
a matter of fact, one can position the study in a more comprehensive setting of Granular Computing 
[BP03a,PB02] in which fuzzy sets are just instances of information granules. A number of results in 
fuzzy pattern recognition can be extended to the granular pattern recognition; throughout the text we 
will be making some pertinent observations with this regard. 


22-1 
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Given the breadth of the area of fuzzy pattern recognition, it is impossible to cover all of its essentials. 
This study serves as an introduction to the area with several main objectives: We intend to demonstrate 
the main aspects of the technology of fuzzy sets (or Granular Computing) in the context of the para- 
digm of pattern recognition. With this regard, it is of interest to revisit the fundamentals and see how 
fuzzy sets enhance them at the conceptual, methodological, and algorithmic level and what area of 
applications could benefit the most from the incorporation of the technology of fuzzy sets. 

The presentation follows a top-down approach. We start with some methodological aspects of fuzzy 
sets that are of paramount relevance in the context of pattern recognition. Section 22.3 is devoted to 
information granulation, information granules, and Granular Computing where we elaborate on the 
concept of abstraction and its role in information processing. In the sequel, Section 22.4 is focused on 
supervised learning with fuzzy sets by showing how several main categories of classifiers are constructed 
by taking into consideration granular information. Unsupervised learning (clustering) is discussed 
afterwards and here we show that fuzzy sets play a dominant role given the unsupervised character of 
the learning processes. Fuzzy sets offer an interesting option of quantifying available domain knowl- 
edge, giving rise to an idea of partial supervision or knowledge-based clustering. Selected ideas of data 
and feature reduction are presented in Section 22.6. 

As far as the notation is concerned, we follow the symbols being in common usage. Patterns 
X),X>,...,Xy are treated as vectors in n-dimensional space R’, ||.|| is used to describe a distance (Euclidean, 
Mahalanobis, Hamming, Tchebyshery, etc.). Fuzzy sets will be described by capital letters; the same nota- 
tion is being used for their membership functions. Class labels will be denoted by @, @,, ..., etc., while 
sets of integers will be described as K = {1, 2,..., K}, N= {1, 2,..., N}. 


22.2 Methodology of Fuzzy Sets in Pattern Recognition 


The concept of fuzzy sets augments the principles of pattern recognition in several ways. The well- 
established techniques are revisited and their conceptual and algorithmic aspects are extended. Let us 
briefly highlight the main arguments which also trigger some intensive research pursuits and exhibit 
several far reaching consequences from the applied perspective. 

The leitmotiv is that fuzzy sets help realize user-centricity of pattern recognition schemes. Fuzzy sets 
are information granules of well-defined semantics which form a vocabulary of basic conceptual entities 
using which the problems are being formalized, models built, and decisions articulated. By expressing 
a certain most suitable point of view at the problem at hand and promoting a certain level of specificity, 
fuzzy sets form an effective conceptual framework for pattern recognition. There are two essential facets 
of the overall aspects of the nature of user centricity: 


Class membership are membership grades. This quantification is of interest as there could be patterns 
whose allocation to classes might not be completely described in a Boolean (yes-no) manner. The user 
is more comfortable talking about levels of membership of particular patterns. It is also more instru- 
mental to generate classification results where presented are intermediate values of membership values. 
There is an associated flagging effect: membership values in the vicinity of 0.5 are indicative of further 
needs to analyze the classification results or engage some other classifiers to either gather evidence in 
favor of belongingness to the given class (which is quantified through higher values of the membership 
functions) or collect evidence that justifies a reduction of such membership degrees. 


Fuzzy sets contribute to the specialized, user-centric feature space. The original feature space is trans- 
formed via fuzzy sets and produces a new feature space that is easier to understand and enhances higher 
effectiveness of the classifiers formed at the next phase. Similarly, through the use of fuzzy sets, one could 
achieve a reduction of dimensionality of the original feature space. The nonlinearity effect introduced by 
fuzzy sets could be instrumental in reducing learning time of classifiers and enhancing their discrimi- 
native properties as well as improving their robustness. The tangible advantage results from the non- 
linear character of membership functions. A properly adjusted nonlinearity could move apart patterns 
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FIGURE22.1 Nonlinear transformation realized by a sigmoid membership function: (a) sigmoid membership func- 
tion and (b) original patterns distributed uniformly in the feature are grouped into quite distantly positioned groups. 


belonging to different classes and bring closer those regions in which the patterns belong to the same 
category. For instance, patterns belonging to two classes and distributed uniformly in a one-dimensional 
space, see Figure 22.1, become well separated when transformed through a sigmoid membership func- 
tion A and described in terms of the corresponding membership grades. In essence, fuzzy sets playa role 
of a nonlinear transformation of the feature space. We note that while the patterns are distributed uni- 
formly, Figure 22.1b left, their distribution in the space of membership degrees [0,1] u = A(x) shows two 
groups of patterns that are located on the opposite ends of the unit interval with a large gap in between. 
These two facets of fuzzy set-based user-centricity might be looked at together in a sense of an overall 
interface layer of the core computing faculties of pattern recognition, as illustrated in Figure 22.2. 
Fuzzy pattern recognition dwells on the concepts of information granules and exploits their underly- 
ing formalism. Information granules giving rise to the general idea of Granular Computing help offer 


Fuzzy feature space 


Pattern recognition 
models 


Fuzzy class membership 


Fuzzy pattern recognition 


FIGURE 22.2 Fuzzy sets forming an interface layer (feature space and class membership) and wrapping the core 
computational faculties of pattern recognition. 
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a sound level of abstraction classification problems that need to be considered and address an issue of 
complexity when reducing the level of detail pattern recognition tasks are exposed to. 


22.3 Information Granularity and Granular Computing 


Information granules permeate numerous human endeavors [BP03a,BP03b,Z97]. No matter what 
problem is taken into consideration, we usually express it in a certain conceptual framework of basic 
entities, which we regard to be of relevance to the problem formulation and problem solving. This 
becomes a framework in which we formulate generic concepts adhering to some level of abstraction, 
carry out processing, and communicate the results to the external environment. Consider, for instance, 
image processing. In spite of the continuous progress in the area, a human being assumes a dominant 
and very much uncontested position when it comes to understanding and interpreting images. Surely, 
we do not focus our attention on individual pixels and process them as such but group them together 
into semantically meaningful constructs—familiar objects we deal with in everyday life. Such objects 
involve regions that consist of pixels or categories of pixels drawn together because of their proximity 
in the image, similar texture, color, etc. This remarkable and unchallenged ability of humans dwells on 
our effortless ability to construct information granules, manipulate them, and arrive at sound conclu- 
sions. As another example, consider a collection of time series. From our perspective we can describe 
them in a semi-qualitative manner by pointing at specific regions of such signals. Specialists can effort- 
lessly interpret ECG signals. They distinguish some segments of such signals and interpret their com- 
binations. Experts can interpret temporal readings of sensors and assess the status of the monitored 
system. Again, in all these situations, the individual samples of the signals are not the focal point of the 
analysis and the ensuing signal interpretation. We always granulate all phenomena (no matter if they 
are originally discrete or analog in their nature). Time is another important variable that is subjected 
to granulation. We use seconds, minutes, days, months, and years. Depending upon a specific problem 
we have in mind and who the user is, the size of information granules (time intervals) could vary quite 
dramatically. To the high level management time intervals of quarters of year or a few years could be 
meaningful temporal information granules on basis of which one develops any predictive model. For 
those in charge of everyday operation of a dispatching plant, minutes and hours could form a viable 
scale of time granulation. For the designer of high-speed integrated circuits and digital systems, the 
temporal information granules concern nanoseconds, microseconds, and perhaps microseconds. Even 
such commonly encountered and simple examples are convincing enough to lead us to ascertain that 
(a) information granules are the key components of knowledge representation and processing, (b) the 
level of granularity of information granules (their size, to be more descriptive) becomes crucial to the 
problem description and an overall strategy of problem solving, and (c) there is no universal level of 
granularity of information; the size of granules is problem-oriented and user dependent. 

What has been said so far touched a qualitative aspect of the problem. The challenge is to develop a 
computing framework within which all these representation and processing endeavors could be for- 
mally realized. The common platform emerging within this context comes under the name of Granular 
Computing. In essence, it is an emerging paradigm of information processing. While we have already 
noticed a number of important conceptual and computational constructs built in the domain of system 
modeling, machine learning, image processing, pattern recognition, and data compression in which 
various abstractions (and ensuing information granules) came into existence, Granular Computing 
becomes innovative and intellectually proactive in several fundamental ways: 


¢ It identifies the essential commonalities between the surprisingly diversified problems and tech- 
nologies used there which could be cast into a unified framework we usually refer to as a granular 
world. This is a fully operational processing entity that interacts with the external world (which 
could be another granular or numeric world) by collecting necessary granular information and 
returning the outcomes of the granular computing. 


© 2011 by Taylor and Francis Group, LLC 


Fuzzy Pattern Recognition 22-5 


« With the emergence of the unified framework of granular processing, we get a better grasp as to the 
role of interaction between various formalisms and visualize a way in which they communicate. 

¢ Itbrings together the existing formalisms of set theory (interval analysis) [M66,Z65,Z05,P82,P91, 
PS07] under the same roof by clearly visualizing that in spite of their visibly distinct underpin- 
nings (and ensuing processing), they exhibit some fundamental commonalities. In this sense, 
Granular Computing establishes a stimulating environment of synergy between the individual 
approaches. 

¢ By building upon the commonalities of the existing formal approaches, Granular Computing 
helps build heterogeneous and multifaceted models of processing of information granules by 
clearly recognizing the orthogonal nature of some of the existing and well-established frameworks 
(say, probability theory coming with its probability density functions and fuzzy sets with their 
membership functions). 

¢ Granular Computing fully acknowledges a notion of variable granularity whose range could 
cover detailed numeric entities and very abstract and general information granules. It looks at the 
aspects of compatibility of such information granules and ensuing communication mechanisms 
of the granular worlds. 

¢ Interestingly, the inception of information granules is highly motivated. We do not form informa- 
tion granules without reason. Information granules arise as an evident realization of the funda- 
mental paradigm of abstraction. 


Granular Computing forms a unified conceptual and computing platform. Yet, it directly benefits from 
the already existing and well-established concepts of information granules formed in the setting of set 
theory, fuzzy sets, rough sets, and others. In the setting of this study it comes as a technology contribut- 
ing to pattern recognition. 


22.3.1 Algorithmic Aspects of Fuzzy Set Technology 
in Pattern Recognition: Pattern Classifiers 


Each of these challenges comes with a suite of their own quite specific problems that do require a very 
careful attention both at the conceptual as well as algorithmic level. We have highlighted the list of chal- 
lenges and in the remainder of this study present some of the possible formulations of the associated 
problems and look at their solutions. It is needless to say that our proposal points at some direction that 
deems to be of relevance however does not pretend to offer a complete solution to the problem. Some 
algorithmic pursuits are also presented as an illustration of some possibilities emerging there. 

Indisputably, geometry of patterns belonging to different classes is a focal point implying an overall 
selection and design of pattern classifiers. Each classifier comes with its geometry and this predomi- 
nantly determines its capabilities. While linear classifiers (built on a basis of some hyperplanes) and 
nonlinear classifiers (such as neural networks) are two popular alternatives, there is another point of 
view at the development of the classifiers that dwells on the concept of information granules. Patterns 
belonging to the same class form information granules in the feature space. A description of geometry 
of these information granules is our ultimate goal when designing effective classifiers. 


22.4 Fuzzy Linear Classifiers and Fuzzy Nearest 
Neighbor Classifiers as Representatives 
of Supervised Fuzzy Classifiers 


Linear classifiers [DHSO1] are governed by the well-known linear relationship y(x) = w'x + wo, where 
w and w, are the parameters (weights and bias) of the classifier. The classification rule in case of two 
classes (@, and @,) reads as follows: classify x to @, if y(x) > 0 and assign to @, otherwise. The design of 
such classifier (perceptron) has been intensively discussed in the literature and has resulted in a wealth 
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of algorithms. The property of linear separability of patterns assures us that the learning method con- 
verges in a finite number of iterations. The classification rule does not quantify how close the pattern is 
to the linear boundary, which could be regarded as a certain drawback of this classifier. The analogue of 
the linear classifier expressed in the language of fuzzy sets brings about a collection of the parameters of 
the classifier which are represented as fuzzy numbers, and triangular fuzzy numbers, in particular. The 
underlying formula of the fuzzy classifier comes in the form 


Y(x) = W, @x, ® W, @x,...W, @x, ® Wy (22.1) 


where W,, i= 1, 2,..., n are triangular fuzzy numbers. As a result, the output of the classifier is a fuzzy 
number as well. Note that we used the symbols of addition of multiplication to underline the fact that 
the computing is concerned with fuzzy numbers rather than plain numeric entities. Triangular fuzzy 
number A can be represented as a triple A = <a_, a, a,> with “a” being the modal value of A, and a_ and 
a, standing for the bounds of the membership function. The design criterion considered here stresses 
the separability of the two classes in the sense of the membership degrees produced for a given pattern x. 


More specifically we have the following requirements: 


Max Y(x) ifxe@, Min Y(x) ifxea, (22.2) 


Similarly, as commonly encountered in fuzzy regression, one could consider a crux of the design 
based on Linear Programming. In contrast to linear classifiers, fuzzy linear classifiers produce classi- 
fication results with class quantification; so rather than a binary decision is being generated, we come 
up with the degree of membership of pattern to class @,. An illustration of the concept is illustrated 
in Figure 22.3. 

In virtue of the classification rule, the membership of Y(x) is highly asymmetric. The slope at one 
side of the classification line is reflective of the distribution of patterns belonging to class @,. The 
geometry of the classifier is still associated with a linear boundary. What fuzzy sets offer is a fuzzy set 
of membership associated with this boundary. Linear separability is an idealization of the classifica- 
tion problem. In reality there could be some patterns located in the boundary region which does not 
satisfy the linearity assumption. So even though the linear classifier comes as a viable alternative as a 
first attempt, further refinement is required. The concept of the nearest neighbor (NN) classifier could 
form a sound enhancement of the fuzzy linear classifier. The popularity of the NN classifiers stems from 
the fact that in their development we rely on lazy learning, so no optimization effort is required at all. 
Any new pattern is assigned to the same class as its closest neighbor. The underlying classification rule 
reads as follows: given x, determine x,) in the training set such that i0 = arg, min||x—x,|| assuming that 
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FIGURE 22.3 Fuzzy linear classifier; note asymmetric nature of membership degrees generated around the clas- 
sification boundary. 
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FIGURE 22.4 Linear classifier with a collection of patterns in the boundary region (I) whose treatment is han- 
dled by NN classifier. 


the class membership of xj is @,, x is classified as @, as well. The NN neighbor classifier could involve a 
single closest neighbor (in which case the classification rule is referred to as 1-NN classifier), 3 neighbors 
giving rise to 3-NNs, 5 neighbors resulting in 5-NNs, or k-NNs where “k” is an odd number (k-NN 
classifier). The majority vote implies the class membership of x. The extension of the k-NN classifica- 
tion rule can be realized in many ways. The intuitive one is to compute a degree of membership of x to 
a certain class by looking at the closest “k” neighbors, determining the membership degrees u,(x), i= 1, 
2,..., L, and choosing the highest one as reflective of the allocation of x to given class. Here Card () = L 
(Figure 22.4). 
More specifically, we have 


1 
2 
y: |x=xi]| 
xyeP | ||x —x;]| 


(we will note a resemblance of this expression to the one describing membership degrees computed in 
the FCM algorithm). Given that pattern i0 where i0=arg max,_, >, , u(x) belongs to class @,, we assign 
x to the same class with the corresponding membership degree. The patterns positioned close to the 
boundary of the linear classifier are engaged in the NN classification rule. The architecture illustrated 
in Figure 22.5 comes as an aggregate of the fuzzy linear classifier and the fuzzy NN classifier which 
modifies the original membership degrees coming from the linear classifier by adjusting them on a 
basis of some local characteristics of the data. The output of the NN classifier (generating the highest 


uj (x)= (22.3) 


Fuzzy linear classifier 
Y(x) 


Aggregation 


Fuzzy NN classifier | '? 


FIGURE 22.5 Combination of two fuzzy classifiers: fuzzy linear classifier focused on the global nature of 
classification is adjusted by the results formed by the fuzzy NN classifier acting on a local basis. 
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membership degree) is governed by the expression [1 = max,_,, u,(x) and iy = arg maxXj-,) u,(x). Let us 
introduce the following indicator (characteristic) function: 

1 if ig =1 (class @,) 

= 22.4 

(x) { if ip = 2 (class @,) a) 


The aggregation of the classification results produced by the two classifiers is completed in the following 
fashion with the result Q being a degree of membership to class @,: 

. max(1,Y(x) + 110(x)) ifo(x)=1 (2.5) 

min(0,Y(x)—(1-O(x)) if p(x) =0 ; 


Note that if the NN classifier has assigned x to class @,, this class membership elevates the membership 
degree produced by the fuzzy linear classifier, hence we arrive at the clipped sum max (1, Y(x) + 1). In 
the opposite case where the NN classification points at the assignment to the second class, the overall 
class membership to , becomes reduced by yl. As a result, given a subset of patterns I in the boundary 
region of the fuzzy linear classifier, the classification region is adjusted accordingly. Its overall geometry 
is more complicated and nonlinear adjusting the original linear form to the patterns located in this 
boundary region. 


22.4.1 Fuzzy Logic—Oriented Classifiers 


Fuzzy sets and information granules, in general, offer a structural backbone of fuzzy classifiers. The 
crux of the concept is displayed in Figure 22.6. Information granules are formed in the feature space. 
They are logically associated with classes in the sense that for each class its degree of class membership 
is a logic expression of the activation levels (matching degrees) of the individual information granules. 
The flexibility of the logic mapping is offered through the use of the collection of logic neurons (fuzzy 
neurons) whose connections are optimized during the design of the classifier. 


22.4.2 Main Categories of Fuzzy Neurons 


There are two main types of logic neurons: aggregative and referential neurons. Each of them comes 
with a clearly defined semantics of its underlying logic expression and is equipped with significant para- 
metric flexibility necessary to facilitate substantial learning abilities. 


Logic mapping 


FIGURE 22.6 An overall scheme of logic mapping between information granules—fuzzy sets formed in the 
feature space and the class membership degrees. 
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22.4.2.1 Aggregative Neurons 


Formally, these neurons realize a logic mapping from [0, 1]" to [0, 1]. Two main classes of the processing 
units exist in this category [P95,PR93,HP93]. 

OR neuron: This realizes an and logic aggregation of inputs x = [x, x,... X,] with the corresponding 
connections (weights) w = [w, w,... w,] and then summarizes the partial results in an or-wise manner 
(hence the name of the neuron). The concise notation underlines this flow of computing, y = OR(x; w) 
while the realization of the logic operations gives rise to the expression (commonly referring to it as an 
s-t combination or s-t aggregation) 


y = S (x tw;) (22.6) 
i=l 


Bearing in mind the interpretation of the logic connectives (t-norms and t-conorms), the OR neuron 
realizes the following logic expression being viewed as an underlying logic description of the processing 
of the input signals: 


(x, and w,) or (x, and w,) or... or (x, and w,) (22.7) 


Apparently, the inputs are logically “weighted” by the values of the connections before producing the 
final result. In other words we can treat “y” as a truth value of the above statement where the truth 
values of the inputs are affected by the corresponding weights. Noticeably, lower values of w, discount 
the impact of the corresponding inputs; higher values of the connections (especially those being 
positioned close to 1) do not affect the original truth values of the inputs resulting in the logic formula. 
In limit, if all connections w,, i =1, 2,...,n are set to 1, then the neuron produces a plain or-combination 
of the inputs, y = x, or x, or ... or x,. The values of the connections set to zero eliminate the cor- 
responding inputs. Computationally, the OR neuron exhibits nonlinear characteristics (that is 
inherently implied by the use of the t- and t-conorms (that are evidently nonlinear mappings). The 
connections of the neuron contribute to its adaptive character; the changes in their values form the crux 
of the parametric learning. 

AND neuron: the neurons in the category, described as y = AND(x; w) with x and w being defined as 
in case of the OR neuron, are governed by the expression 


y = T(x\sw;) (22.8) 
i=l 


Here the or and and connectives are used in a reversed order: first the inputs are combined with the use 
of the t-conorm and the partial results produced in this way are aggregated and-wise. Higher values of 
the connections reduce impact of the corresponding inputs. In limit w; =1 eliminates the relevance of x;. 
With all w, set to 0, the output of the AND neuron is just an and aggregation of the inputs 


y =x, and x, and ... and x, (22.9) 


Let us conclude that the neurons are highly nonlinear processing units whose nonlinear mapping 
depends upon the specific realizations of the logic connectives. They also come with potential plasticity 
whose usage becomes critical when learning the networks including such neurons. 

At this point, it is worth contrasting these two categories of logic neurons with “standard” neurons 
we encounter in neurocomputing. The typical construct there comes in the form of the weighted sum 
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of the inputs x,, X,,..., X, with the corresponding connections (weights) w,, w,, .... W, being followed by 
a nonlinear (usually monotonically increasing) function that reads as follows: 


y= g(w"x +t)=g Swix, +T (22.10) 


where 
w is a vector of connections 
t is a constant term (bias) 


«9 


g” denotes some monotonically non-decreasing nonlinear mapping 


The other less commonly encountered neuron is a so-called m-neuron. While there could be some varia- 
tions as to the parametric details of this construct, we can envision the following realization of the 
neuron: 


y =s(] -!") (22.11) 


where 

t= [t, t, ... t,] denotes a vector of translations 

w (>0) denotes a vector of all connections 
As before, the nonlinear function is denoted by “g.” While some superficial and quite loose analogy 
between these processing units and logic neurons could be derived, one has to cognizant that these 
neurons do not come with any underlying logic fabric and hence cannot be easily and immediately 
interpreted. 

Let us make two observations about the architectural and functional facets of the logic neurons we 
have introduced so far. 


Incorporation of the bias term (bias) in the fuzzy logic neurons. In analogy to the standard constructs of 
a generic neuron as presented above, we could also consider a bias term, denoted by wy, € [0, 1], which 
enters the processing formula of the fuzzy neuron in the following manner: 


For the OR neuron 
y = S (x;twj)swo (22.12) 
i=l 
For the AND neuron 


y = T (xsw;) two (22.13) 
i=l 


We can offer some useful interpretation of the bias by treating it as some nonzero initial truth value 
associated with the logic expression of the neuron. For the OR neuron it means that the output does 
not reach values lower than the assumed threshold. For the AND neuron equipped with some bias, we 
conclude that its output cannot exceed the value assumed by the bias. The question whether the bias is 
essential in the construct of the logic neurons cannot be fully answered in advance. Instead, we may 
include it into the structure of the neuron and carry out learning. Once its value has been obtained, its 
relevance could be established considering the specific value it has been produced during the learning. 
It may well be that the optimized value of the bias is close to zero for the OR neuron or close to one in the 
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case of the AND neuron which indicates that it could be eliminated without exhibiting any substantial 
impact on the performance of the neuron. 


Dealing with inhibitory character of input information. Owing to the monotonicity of the t-norms and 
t-conorms, the computing realized by the neurons exhibits an excitatory character. This means that 
higher values of the inputs (x,) contribute to the increase in the values of the output of the neuron. The 
inhibitory nature of computing realized by “standard” neurons by using negative values of the connec- 
tions or the inputs is not available here as the truth values (membership grades) in fuzzy sets are con- 
fined to the unit interval. The inhibitory nature of processing can be accomplished by considering the 
complement of the original input, =1 — x,. Hence, when the values of x, increase, the associated values of 
the complement decrease and subsequently in this configuration we could effectively treat such an input 
as having an inhibitory nature. 


22.4.3 Architectures of Logic Networks 


The logic neurons (aggregative and referential) can serve as building blocks of more comprehensive and 
functionally appealing architectures. The diversity of the topologies one can construct with the aid of 
the proposed neurons is surprisingly high. This architectural diversity is important from the application 
point of view as we can fully reflect the nature of the problem in a flexible manner. It becomes essential 
to capture the underlying nature of the problem and set up a logic skeleton of the network along with 
an optimization of its parameters. Throughout the entire development process we are positioned quite 
comfortably by monitoring the optimization of the network as well as interpreting its semantics. 


22.4.3.1 Logic Processor in the Processing of Fuzzy Logic Functions: 
A Canonical Realization 


The typical logic network that is at the center of logic processing originates from the two-valued logic 
and comes in the form of the famous Shannon theorem of decomposition of Boolean functions. Let us 
recall that any Boolean function {0, 1}" > {0,1} can be represented as a logic sum of its corresponding 
miniterms or a logic product of maxterms. By a minterm of “n” logic variables x,, x, ..., x, we mean a 
logic product involving all these variables either in direct or complemented form. Having “n” variables 
we end up with 2" minterms starting from the one involving all complemented variables and ending up 
at the logic product with all direct variables. Likewise by a maxterm we mean a logic sum of all variables 
or their complements. Now in virtue of the decomposition theorem, we note that the first representa- 
tion scheme involves a two-layer network where the first layer consists of AND gates whose outputs are 
combined in a single OR gate. The converse topology occurs for the second decomposition mode: there 
is a single layer of OR gates followed by a single AND gate aggregating or-wise all partial results. 

The proposed network (referred here as a logic processor) generalizes this concept, as shown in Figure 
22.7. The OR-AND mode of the logic processor comes with the two types of aggregative neurons being 


- 


AND OR 
neurons neuron 


FIGURE 22.7 A topology of the logic processor in its AND-OR mode. 
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swapped between the layers. Here the first (hidden) layer is composed of the OR neuron and is followed 
by the output realized by means of the AND neuron. 

The logic neurons generalize digital gates. The design of the network (viz. any fuzzy function) is real- 
ized through learning. If we confine ourselves to Boolean {0,1} values, the network’s learning becomes 
an alternative to a standard digital design, especially a minimization of logic functions. The logic pro- 
cessor translates into a compound logic statement (we skip the connections of the neurons to underline 
the underlying logic content of the statement): 


+ if (input, and ... and input)) or (input, and ... and input,) then class membership 


The logic processor’s topology (and underlying interpretation) is standard. Two LPs can vary in terms of 
the number of AND neurons, their connections but the format of the resulting logic expression stays quite 
uniform (as a sum of generalized minterms) and this introduces a significant level of interpretability. 


22.4.4 Granular Constructs of Classifiers 


The fundamental development strategy pursued when dealing with this category of classifiers dwells upon 
the synergistic and highly orchestrated usage of two fundamental technologies of Granular Computing, 
namely intervals (hyperboxes) and fuzzy sets. Given this, the resulting constructs will be referred to as gran- 
ular hyperbox-driven classifiers. The architecture of the hyperbox-driven classifier (HDC, for brief) comes 
with two well-delineated architectural components that directly imply its functionality. The core (primary) 
part of the classifier which captures the essence of the structure is realized in terms of interval analysis. 
Sets are the basic constructs that form the regions of the feature space where there is a high homogeneity 
of the patterns (which implies low classification error). We may refer to it as a core structure. Fuzzy sets are 
used to cope with the patterns outside the core structure and in this way contribute to a refinement of the 
already developed core structure. This type of the more detailed structure will be referred to as a secondary 
one. The two-level granular architecture of the classifier reflects a way in which classification processes are 
usually carried out: we start with a core structure where the classification error is practically absent and 
then consider the regions of high overlap between the classes where there is a high likelihood of the clas- 
sification error. For the core structure, the use of sets as generic information granules is highly legitimate: 
there is no need to distinguish between these elements of the feature space. The areas of high overlap require 
more detailed treatment hence here arises a genuine need to consider fuzzy sets as the suitable granular con- 
structs. The membership grades play an essential role in expressing levels of confidence associated with the 
classification result. In this way, we bring a detailed insight into the geometry of the classification problem 
and identify regions of very poor classification. One can view the granular classifier as a two-level hierar- 
chical classification scheme whose development adheres to the principle of a stepwise refinement of the 
construct with sets forming the core of the architecture and fuzzy sets forming its specialized enhancement. 
Given this, a schematic view of the two-level construct of the granular classifier is included in Figure 22.8. 

One of the first approaches to the construction of set-based classifiers (hyperboxes) were presented by 
Simpson [$92,593] both in supervised and unsupervised mode. Abe et al. [ATK 98] presented an efficient 
method for extracting rules directly from a series of activation hyperboxes, which capture the existence 
region of data for a given class and inhibition hyperboxes, which inhibit the existence of data of that 
class. Rizzi et al. [RMM00,RPM02] proposed an adaptive resolution classifier (ARC) and its pruned ver- 
sion (PARC) in order to enhance the constructs introduced by Simpson. ARC/PARC generates a regu- 
larized min-max network by a series of hyperbox cuts. Gabrys and Bargiela [GB00] described a general 
fuzzy min-max (GFMM) neural network which combines mechanisms of supervised and unsupervised 
learning into a single unified framework. 

The design of the granular classifiers offers several advantages over some “standard” pattern classifiers. 
First, the interpretability is highly enhanced: both the structure and the conceptual organization appeals in 
a way in which an interpretation of the topology of patterns is carried out. Second, one can resort himself to 
the existing learning schemes developed both for set-theoretic classifiers and fuzzy classifiers [PP08,PS05]. 
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FIGURE 22.8 From sets to fuzzy sets: (a) a principle of a two-level granular classifier exploiting the successive 
usage of the formalisms of information granulation and (b) further refinements of the information granules real- 
ized on a basis of the membership degrees. 


22.5 Unsupervised Learning with Fuzzy Sets 


Unsupervised learning, quite commonly treated as an equivalent of clustering is the subdiscipline of 
pattern recognition which is aimed at the discovery of structure in data and its representation in the 
form of clusters—groups of data. 

Clusters, in virtue of their nature, are inherently fuzzy. Fuzzy sets constitute a natural vehicle to quan- 
tify strength of membership of patterns to a certain group. An example shown in Figure 22.9 clearly 
demonstrates this need. The pattern positioned in between the two well-structured and compact groups 
exhibits some level of resemblance (membership) to each of the clusters. Surely enough, one could be 
hesitant to allocate it fully to either of the clusters. The membership values such as, e.g., 0.55 and 0.45 
are not only reflective of the structure in the data but they flag the distinct nature of this data and maybe 
trigger some further inspection of this pattern. In this way we remark a user-centric character of fuzzy 
sets which make interaction with users more effective and transparent. 


22.5.1 Fuzzy C-Means as an Algorithmic Vehicle of Data 
Reduction through Fuzzy Clusters 


Fuzzy sets can be formed on a basis of numeric data through their clustering (groupings). The groups 
of data give rise to membership functions that convey a global more abstract and general view at the 
available data. With this regard Fuzzy C-Means (FCM, for brief) is one of the commonly used mechanisms 
of fuzzy clustering [B81,P05]. 


FIGURE 22.9 Example of two-dimensional data with patterns of varying membership degrees to the two highly 
visible and compact clusters. 
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Let us review its formulation, develop the algorithm and highlight the main properties of the fuzzy 
clusters. Given a collection of n-dimensional data set {x,}, k = 1,2,...,N, the task of determining its 
structure—a collection of “c” clusters—is expressed as a minimization of the following objective func- 
tion (performance index) Q being regarded as a sum of the squared distances between data and their 


representatives (prototypes): 


c N 
e-), yur IIx, — vil? (22.14) 
1 


i=l k= 


Here 
v,s are n-dimensional prototypes of the clusters, i= 1, 2, ...,¢ 
U = [u,,] stands for a partition matrix expressing a way of allocation of the data to the corresponding 
clusters 
uj, is the membership degree of data x, in the ith cluster 


The distance between the data z, and prototype v, is denoted by |].||. The fuzzification coefficient m (>1.0) 
expresses the impact of the membership grades on the individual clusters. It implies as certain geometry 
of fuzzy sets which will be presented later in this study. 

A partition matrix satisfies two important and intuitively appealing properties: 


N 

0< Sue <n, aa eee (22.15a) 
k=1 

Suga k=1,2,...N (22.15b) 


Let us denote by U a family of matrices satisfying (22.15a) through (22.15b). The first requirement states 
that each cluster has to be nonempty and different from the entire set. The second requirement states 
that the sum of the membership grades should be confined to 1. 

The minimization of Q completed with respect to U € U and the prototypes v;, of V = {v,, V,... v} of 
the clusters. More explicitly, we write it down as follows: 


min Q with respect to Ue U, vi, Va; --.. Vee R" (22.16) 


From the optimization standpoint, there are two individual optimization tasks to be carried out sepa- 
rately for the partition matrix and the prototypes. The first one concerns the minimization with respect 
to the constraints given the requirement of the form (22.15b) which holds for each data point x,. The use 
of Lagrange multipliers converts the problem into its constraint-free version. The augmented objective 
function formulated for each data point, k = 1, 2,..., N, reads as 


G 


v=) ug tA pyres! (22.17) 
i=l 


i=l 


where di, =||x —v;|[’. 
It is instructive to go over the details of the optimization process. Starting with the necessary condi- 
tions for the minimum of V for k = 1,2,...N, one obtains 


av _, ov_ 


ee —=0 22.18 
Ug, On ( 


© 2011 by Taylor and Francis Group, LLC 


Fuzzy Pattern Recognition 22-15 


s=1,2...c,t=1,2... N. Nowwe calculate the derivative of V with respect to the elements of the parti- 
tion matrix in the following way: 


ov 


—=mur'd +A (22.19) 
OUg, 


Given this relationship, and using (22.15) we calculate u,, 


x 1/(m-1) 
Us = -(*) qe (22.20) 


m 


; Uj =1 and plugging it into (22.20) one has 
j=l 


x /(m-1) c¢ 
-(*) ae =] (22.21) 


Taking into account the normalization condition y 


We compute 


x 1/(m—1) l 
=—- (22.22) 


Inserting this expression into (22.20), we obtain the successive entries of the partition matrix: 


1 


1/(m=1) 
2 
ia st 
pan i 


i= (22.23) 


jt 


The optimization of the prototypes v; is carried out assuming the Euclidean distance between the data 
n 
and the prototypes that is ||x, — vi? = as —vi)’, The objective function reads now as follows 
c N n = 
Q= » ys wy (Xig— Vi) and its gradient with respect to v,, V,,Q made equal to zero yields 
i=l k=1 jel 


the system of linear equations: 
N 


cer —Vs)=0 (22.24) 


k=1 


SH 152545. Ce BH1j 23 igh: 
Thus, 


7 m 
bs UX kt 
a (22.25) 


Overall, the FCM clustering is completed through a sequence of iterations where we start from some 
random allocation of data (a certain randomly initialized partition matrix) and carry out the follow- 
ing updates by adjusting the values of the partition matrix and the prototypes. The iterative process is 
continued until a certain termination criterion has been satisfied. Typically, the termination condition 
is quantified by looking at the changes in the membership values of the successive partition matrices. 
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TABLE 22.1 Main Features of the FCM Clustering Algorithm 


Feature of the FCM Algorithm Representation and Optimization Aspects 


Number of clusters (c) Structure in the data set and the number of fuzzy sets estimated by the method; 
the increase in the number of clusters produces lower values of the objective function 
however given the semantics of fuzzy sets one should maintain this number quite low 
(5-9 information granules) 


Objective function Q Develops the structure aimed at the minimization of Q; iterative process supports the 
determination of the local minimum of Q 

Distance function ||-|| Reflects (or imposes) a geometry of the clusters one 1s looking for; essential design 
parameter affecting the shape of the membership functions 

Fuzzification coefficient (m) Implies a certain shape of membership functions present in the partition matrix; 


essential design parameter. Low values of “m” (being close to 1.0) induce 
characteristic function. The values higher than 2.0 yield spiky membership functions 

Termination criterion Distance between partition matrices in two successive iterations; the algorithm 
terminated once the distance below some assumed positive threshold (¢) that is 
|[UUiter + 1) — Uliter)|| < e 


Denote by U(t) and U(t+1) the two partition matrices produced in the two consecutive iterations of 
the algorithm. If the distance || U(t+ 1) — U(t)|| is less than a small predefined threshold € (say, ¢ = 10% 
or 10-°), then we terminate the algorithm. Typically, one considers the Tchebyschev distance between 
the partition matrices meaning that the termination criterion reads as follows: 


max, |uy(t+1)—ug(t)|<¢ (22.26) 


The key components of the FCM and a quantification of their impact on the form of the produced results 
are summarized in Table 22.1. 

With regard to the computing supported by the FCM algorithm, an essential point should be made. 
The calculations of the prototypes as given in (22.25) are feasible considering the nature of the distance. 
The use of the Euclidean distance invokes (22.24) while any other distance (which could be potentially 
quite appealing given the geometry of the induced regions in the feature space) does not lead to this 
closed-type of optimization scheme. 

The fuzzification coefficient exhibits a direct impact on the geometry of fuzzy sets generated by the 
algorithm. Typically, the value of “m” is assumed to be equal to 2.0. Lower values of m (that are closer to 1) 
yield membership functions that start resembling characteristic functions of sets; most of the member- 
ship values become localized around 1 or 0. The increase of the fuzzification coefficient (m = 3, 4, etc.) 
produces “spiky” membership functions with the membership grades equal to 1 at the prototypes and 
a fast decline of the values when moving away from the prototypes. Furthermore the average values of 
the membership function are equal to 1/c. Several illustrative examples of the membership functions are 
included in Figure 22.10. In addition to the varying shape of the membership functions, observe that the 
requirement put on the sum of membership grades imposed on the fuzzy sets yields some rippling effect: 
the membership functions are not unimodal but may exhibit some ripples whose intensity depends 
upon the distribution of the prototypes and the values of the fuzzification coefficient. 

The membership functions offer an interesting feature of evaluating an extent to which a certain data 
point is shared between different clusters and in this sense become difficult to allocate to a single cluster 
(fuzzy set). Let us introduce the following index which serves as a suitable separation measure between 
the clusters: 


Cc 


(ty, U,...,U,) =1 -c[ Ju, (22.27) 


i=l 
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FIGURE 22.10 Examples of membership functions of fuzzy sets; the prototypes are equal to 1, 3.5, and 5 while the 
fuzzification coefficient assumes values of (a) 1.2, (b) 2.0, and (c) 3.5. The intensity of the rippling effect is affected 


«> 


by the values of “m” and increases with the higher values of “m.” 


where uj, U,, ..., U. are the membership degrees for some data point. If only one of membership degrees, 
say u, = 1, and the remaining are equal to zero, then the separation index attains its maximum equal to 1. 
On the other extreme, when the data point is shared by all clusters to the same degree equal to I/c, then 
the value of the index drops down to zero. This means that there is no separation between the clusters 
as reported for this specific point. 


22.5.2 Knowledge-Based Clustering 


As is well-known, clustering and supervised pattern recognition (classification) are the two opposite 
poles of the learning paradigm. In reality, there is no “pure” unsupervised learning. There is no fully 
supervised learning as some labels might not be completely reliable (as those encountered in case of 
learning with probabilistic teacher). 

There is some domain knowledge and it has to be carefully incorporated into the generic clustering pro- 
cedure. Knowledge hints can be conveniently captured and formalized in terms of fuzzy sets. Altogether 
with the underlying clustering algorithms, they give rise to the concept of knowledge-based clustering— 
a unified framework in which data and knowledge are processed together in a uniform fashion. 

We discuss some of the typical design scenarios of knowledge-based clustering and show how the 
domain knowledge can be effectively incorporated into the fabric of the original data-driven only clus- 
tering techniques. 
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We can distinguish several interesting and practically viable ways in which domain knowledge is 
taken into consideration: 


A subset of labeled patterns. The knowledge hints are provided in the form of a small subset of labeled 
patterns K Cc N [PW97a,PW97b]. For each of them we have a vector of membership grades f,, ke K 
which consists of degrees of membership the pattern is assigned to the corresponding clusters. As usual, 


we have f;, € [0, 1] and S, f= 1. 
i=l 


Proximity-based clustering. Here we are provided a collection of pairs of patterns [LPS03] with speci- 
fied levels of closeness (resemblance) which are quantified in terms of proximity, prox (k, l) expressed 
for x, and x, The proximity offers a very general quantification scheme of resemblance: we require 
reflexivity and symmetry, that is, prox(k, k) = 1 and prox(k, I) = prox(1, k); however, no transitivity is 
needed. 

“Belong” and “not-belong” Boolean relationships between patterns. These two Boolean relationships 
stress that two patterns should belong to the same clusters, R(x,, x,) = 1 or they should be placed apart 
in two different clusters, R(x,, x,) = 0. These two requirements could be relaxed by requiring that these 
two relationships return values close to one or zero. 


Uncertainty of labeling/allocation of patterns. We may consider that some patterns are “easy” to assign 
to clusters while some others are inherently difficult to deal with meaning that their cluster allocation 
is associated with a significant level of uncertainty. Let P(x,) stands for the uncertainty measure (e.g., 
entropy) for x, (as a matter of fact, ® is computed for the membership degrees of x, that is D(u,,) with u, 
being the kth column of the partition matrix. The uncertainty hint is quantified by values close to 0 or 1 
depending upon what uncertainty level a given pattern is coming from. 

Depending on the character of the knowledge hints, the original clustering algorithm needs to be 
properly refined. In particular the underlying objective function has to be augmented to capture the 
knowledge-based requirements. Shown below are several examples of the extended objective functions 
dealing with the knowledge hints introduced above. 

When dealing with some labeled patterns, we consider the following augmented objective function: 


c N c N 
Q= SY Luvs, -vil +0). DY) Cun - fds lbs - vill (22.28) 
1 


i=l k= i=l k=l 


where the second term quantifies distances between the class membership of the labeled patterns and 
the values of the partition matrix. The positive weight factor (a) helps set up a suitable balance between 
the knowledge about classes already available and the structure revealed by the clustering algorithm. The 
Boolean variable b, assumes values equal to 1 when the corresponding pattern has been labeled. 

The proximity constraints are accommodated as a part of the optimization problem where we mini- 
mize the distances between proximity values being provided and those generated by the partition 
matrix P(k,, k,): 


c N 
0-3, Sst vi 
1 


isl ke 
\|prox(k,, k,) — P(k,, ky)|| > Mink,,k,¢ K (22.29) 
with K being a pair of patterns for which the proximity level has been provided. It can be shown that 


c 

given the partition matrix the expression > min(Uj),Uy2) generates the corresponding proximity 
i=l 

value. 
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For the uncertainty constraints, the minimization problem can be expressed as follows: 


a= SY ables? 


i=l k=l 


\|®(u,.) —¥:|| > Mink K (22.30) 


where K stands for the set of patterns for which we are provided with the uncertainty values y,,. 
Undoubtedly the extended objective functions call for the optimization scheme that is more demand- 
ing as far as the calculations are concerned. In several cases we cannot modify the standard technique 
of Lagrange multipliers which leads to an iterative scheme of successive updates of the partition matrix 
and the prototypes. In general, though, the knowledge hints give rise to a more complex objective func- 
tion in which the iterative scheme cannot be useful in the determination of the partition matrix and 
the prototypes. Alluding to the generic FCM scheme, we observe that the calculations of the prototypes 
in the iterative loop are doable in case of the Euclidean distance. Even the Hamming or Tchebyshev 
distance brings a great deal of complexity. Likewise, the knowledge hints lead to the increased complex- 
ity: the prototypes cannot be computed in a straightforward way and one has to resort himself to more 
advanced optimization techniques. Evolutionary computing arises here as an appealing alternative. We 
may consider any of the options available there including genetic algorithms, particle swarm optimiza- 
tion, ant colonies, to name some of them. The general scheme can be schematically structured as follows: 


* repeat {EC (prototypes); compute partition matrix U3} 


22.6 Data and Dimensionality Reduction 


The problem of dimensionality reduction [DHS01] and complexity management in pattern recognition 
is by no means a new endeavor. This has led to a number of techniques which as of now are regarded 
classic and are used quite intensively. There have been a number of approaches deeply rooted in classic 
statistical analysis. The ideas of principal component analysis, Fisher analysis, and alike are the tech- 
niques of paramount relevance. What has changed quite profoundly over the decades is the magnitude 
of the problem itself which has forced us to the exploration of new ideas and optimization techniques 
involving advanced techniques of global search including tabu search and biologically inspired 
optimization mechanisms. 

In a nutshell, we can distinguish between two fundamental reduction processes involving (a) data 
and (b) features (attributes). Data reduction is concerned with grouping patterns and revealing their 
structure in the form of clusters (groups). Clustering is regarded as one of the fundamental techniques 
within the domain of data reduction. Typically, we start with thousand of data points and arrive at 
10-15 clusters. The nature of the clusters could vary depending upon the underlying formalisms. While 
in most cases, the representatives of the clusters are numeric entities such as prototypes or medoids, we 
can encounter granular constructs such as, e.g., hyperboxes. 

Feature or attribute reduction [F90,M03,UT07,DW M072] deals with (a) transformation of the feature 
space into another feature space of a far lower dimensionality or (b) selection of a subset of features that 
are regarded to be the most essential (dominant) with respect to a certain predefined objective function. 
Considering the underlying techniques of feature transformation, we encounter a number of classic 
linear statistical techniques such as, e.g., principal component analysis or more advanced nonlinear 
mapping mechanisms realized by, e.g., neural networks. 

The criteria used to assess the quality of the resulted (reduced) feature space give rise to the two 
general categories, namely filters and wrappers. Using filters we consider some criterion that pertains 
to the statistical internal characteristics of the selected attributes and evaluate them with this respect. 
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Wrappers reduction reduction 
Fuzzy sets Feature Feature Filters 
(clusters) transformation selection Wrappers 
df Fuzzy sets 
Data an feature (features) 
reduction 
Fuzzy sets 
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FIGURE 22.11 Categories of reduction problems in pattern recognition: feature and data reduction, use of filters, 
and wrappers criteria. 


In contrast, when dealing with wrappers, we are concerned with the effectiveness of the features as a 
vehicle to carry out classification so in essence there is a mechanism (e.g., a certain classifier) which 
effectively evaluates the performance of the selected features with respect to their discriminatory (that 
is external) capabilities. In addition to feature and data reduction being regarded as two separate pro- 
cesses, we may consider their combinations in which both features and data are reduced. The general 
view at the reduction mechanisms is presented in Figure 22.11. Fuzzy sets augment the principles of 
the reduction processes. In case of data, fuzzy clusters reveal a structure in data and quantifying the 
assignment through membership grades. The same clustering mechanism could be applied to features 
and form collections of features which could be treated en block. The concept of biclustering engages the 
reduction processes realized together and leads to a collection of fuzzy sets described in the Cartesian 
product of data and features. In this sense, with each cluster one associates a collection of the features 
that describe it to the highest extent and make it different from the other clusters. 


22.7 Conclusions 


Fuzzy pattern recognition comes as a coherent and diversified setting of pattern recognition. The 
human-centricity is one of its dominant aspects, which is well supported by the essence of fuzzy sets. 
They are effectively used in the realization of more focused and specialized feature space being quite 
often of reduced dimensionality in comparison with the original one. The ability to quantify class mem- 
bership by departing from the rigid 0-1 quantification and allowing for in-between membership grades 
is another important facet of pattern recognition, facilitating interaction with humans (designers and 
users of the classifiers). The logic fabric of the classifiers is yet another advantage of fuzzy classifiers 
enhancing their interpretability. 

It is worth stressing that there are numerous ways in which the algorithmic aspects of fuzzy sets can 
be incorporated into fuzzy pattern recognition. The existing research, albeit quite diversified, comes 
with a number of conceptual avenues to investigate more thoroughly. This may involve (a) the use of 
other constructs of Granular Computing in a unified fashion, (b) synergy between fuzzy classification 
and probabilistic/statistical pattern classifiers, and (c) exploration of hierarchies of fuzzy pattern recog- 
nition constructs. 
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23.1 Introduction 


Mathematical models are indispensable when we wish to rigorously analyze dynamic systems. Such 
a model summarizes and interprets the empirical data. It can also be used to simulate the system on 
a computer and to provide predictions for future behavior. Mathematical models of the atmosphere, 
which can be used to provide weather predictions, are a classic example. 

In physics, and especially in classical mechanics, it is sometimes possible to derive mathematical 
models using first principles such as the Euler-Lagrange equations [1]. In other fields of science, like biol- 
ogy, economics, and psychology, no such first principles are known. In many cases, however, researchers 
have provided descriptions and explanations of various phenomena stated in natural language. Science 
can greatly benefit from transforming these verbal descriptions into mathematical models. This raises 
the following problem. 


* This chapter is based on the paper “The fuzzy ant,” by V. Rozin and M. Margaliot which appeared in the IEEE 
Computational Intelligence Magazine, 2, 18-28, 2007. © IEEE. 
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Problem 23.1 Find an efficient way to transform verbal descriptions into a mathematical model or computer 
algorithm. 


This problem has already been addressed in the field of artificial expert systems (AESs) [2]. These are 
computer algorithms that emulate the functioning of a human expert, for example, a physician who can 
diagnose diseases or an operator who can successfully control a specific system. One approach to con- 
structing AESs is based on questioning the human expert in order to extract information on his/her func- 
tioning. This leads to a verbal description, which must then be transformed into a computer algorithm. 

Fuzzy logic theory has been associated with human linguistics ever since it was first suggested by 
Zadeh [3,4]. In particular, fuzzy modeling (FM) is routinely used to transform the knowledge of a human 
expert, stated in natural language, into a fuzzy expert system that imitates the expert’s functioning [5,6]. 
The knowledge extracted from the human expert is stated as a collection of If-Then rules expressed 
using natural language. Defining the verbal terms in the rules using suitable membership functions, and 
inferring the rule base, yields a well-defined mathematical model. Thus, the verbal information is trans- 
formed into a form that can be programmed on a computer. This approach has been used to develop 
AESs that diagnose diseases, control various processes, and much more [2,6,7]. 

The overwhelming success of fuzzy expert systems suggests that FM may be a suitable approach for 
solving Problem 23.1. Indeed, decades of successful applications (see, e.g., [8-13]) suggest that the real 
power of fuzzy logic lies in its ability to handle and manipulate verbally-stated information based on 
perceptions rather than equations [14-18]. 

Recently, FM was applied in a different context, namely, in transforming verbal descriptions and 
explanations of natural phenomena into a mathematical model. The goal here is not to replace a human 
expert with a computer algorithm, but rather to assist a researcher in transforming his/her understand- 
ing of the phenomena, stated in words, into a well-defined mathematical model. 

The applicability and usefulness of this approach was demonstrated using examples from the field of 
ethology. FM was applied to transform verbal descriptions of various animal behaviors into mathemati- 
cal models. Examples include the following: (1) territorial behavior in the stickleback [19], as described 
by Nobel Laureate Konrad Lorenz in [20]; (2) the mechanisms governing the orientation to light in 
the planarian Dendrocoleum lacteum [21]; (3) flocking behavior in birds [22]; (4) the self-regulation of 
population size in blow-flies [23]; and (5) the switching behavior of an epigenetic switch in the lambda 
virus [24]. 

There are several reasons that FM seems particularly suitable for modeling animal behavior. First, many 
animal (and human) actions are “fuzzy.” For example, the response to a (low intensity) stimulus might 
be what Heinroth called intention movements, that is, a slight indication of what the animal is tending 
to do. Tinbergen [25, Ch. IV] states: “As a rule, no sharp distinction is possible between intention move- 
ments and more complete responses; they form a continuum.” In this respect, it is interesting to recall 
that Zadeh [3] defined a fuzzy set as “a class of objects with a continuum of grades of membership.” Hence, 
FM seems an appropriate tool for studying such behaviors. The second reason is that studies of animal 
behavior often provide a verbal description of both field observations and interpretations. For example, 
Fraenkel and Gunn describe the behavior of a cockroach that becomes stationary when a large part of its 
body surface is in contact with a solid object as “A high degree of contact causes low activity ...” [26, p. 23]. 
Note that this can be immediately stated as the fuzzy rule: If degree of contact is high, then activity is low. 
In fact, it is customary to describe the behavior of simple organisms using simple rules of thumb [27]. 


23.1.1 Fuzzy Modeling and Biomimicry 


Considerable research is currently devoted to the field of biomimicry—the development of artificial 
products or machines that mimic biological phenomena [28-31]. Over the course of evolution, living 
systems have developed efficient and robust solutions to various problems. Some of these problems are 
also encountered in engineering applications. For example, plants had to develop efficient mechanisms 
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for absorbing and utilizing solar energy. Engineers who design solar cells face a similar challenge. More 
generally, many natural beings have developed the capabilities to reason, learn, evolve, adapt, and heal. 
Scientists and engineers are interested in implementing such capabilities in artificial systems. 

An important component in the design of artificial systems based on natural phenomena is the ability 
to perform reverse engineering of the functioning of the natural phenomena. We believe that FM may 
be suitable for addressing this issue in a systematic manner [32]. Namely, start with a verbal descrip- 
tion of the biological system’s behavior (e.g., foraging in ants) and, using fuzzy logic theory, obtain a 
mathematical model of this behavior that can be immediately implemented by artificial systems (e.g., 
autonomous robots). 

In this chapter, we describe the application of FM to develop a mathematical model for the foraging 
behavior of ants. The resulting model is simpler, more plausible, and more amenable to analysis than 
previously suggested models. Simulations and rigorous analysis of the resulting model show that it is 
congruent with the behavior actually observed in nature. Furthermore, the new model establishes an 
interesting link between the averaged behavior of a colony of foraging ants and mathematical models 
used in the theory of artificial neural networks (ANNs) (see Section 23.8). 

The next section provides a highly simplified, yet hopefully intuitive, introduction to fuzzy model- 
ing. Section 23.3 reviews the foraging behavior of social ants. Section 23.4 applies FM to transform the 
verbal description into a simple mathematical model describing the behavior of a single ant. In Section 
23.5, this is used to develop a stochastic model for the behavior of a colony of identical ants. Section 23.6 
reviews an averaged model of the colony. Sections 23.7 and 23.8 are devoted to studying this averaged 
model using simulations and rigorous analysis, respectively. The final section concludes and describes 
some possible directions for further research. 


23.2 Fuzzy Modeling: A Simple Example 


We begin by presenting the rudiments of FM using a very simple example. More information on FM can 
be found in many textbooks (e.g., [11,33]). Readers familiar with FM may skip to Section 23.3. 
Consider the scalar control system: 


x(t) = u(t), 


where 
x: R — Ris the state of the system 
u: R > Ris the control 


Suppose that our goal is to design a state-feedback control (ie., a control in the form u(é) = u(x())) 
guaranteeing that Lim, ,..x(é) = 0 for any initial condition x(0). It is clear that in order to achieve this, the 
control must be negative (positive) when x(f) is positive (negative). This suggests the following two rules: 


Rule 1: If (x is positive), then u = —c, 
Rule 2: If (« is negative), then u = c, 


where c > 0. 

FM provides an efficient mechanism for transforming such rules into a well-defined mathematical 
formula: u = u(x). The first step is to define the terms in the If part of the rules. To do this, we use two 
functions: Upgsitire(X) ANd Wyegative(X)- Roughly speaking, for a given x, Upositive(x) Measures how true the 
proposition (x is positive) is. For example, we may take 


1, if x>0, 


LU positive(X) = 
0, ifx<o. 
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FIGURE 23.1 Membership functions LL, ,.i:i(%) and Unegative(X): 


However, using such a binary, 0/1, function will lead to a control that changes abruptly as x changes sign. 
It may thus be better to use a smoother function, say 


Unegative (x): =1- (1 + exp(—x))”. 
This is a continuous function taking values in the interval [0,1] and satisfying (see Figure 23.1). 


lim,.._.. Lpositive(®) = 0, lim... yositive(X) =1. 


We may also view [,osirive(x) a8 providing the degree of membership of x in the set of positive numbers. 
A smoother membership function seems more appropriate for sets that are defined using verbal terms 
[34]. To demonstrate this, consider the membership in the set of tall people. A small change in a person’s 
height should not lead to an abrupt change in the degree of membership in this set. 

The second membership function is defined by U,,.gative(X): = 1 — (1 + exp(—x))"1. Note that this implies that 


LU positive(X) + Unegative (x) = 1, for all xE R, 


i.e., the total degree of membership in the two sets is always one. 

Once the membership functions are specified, we can define the degree of firing (DOF) of each rule, fora 
given input x, as DOF,(x) = Upositive(x) and DOF,(x) = Unegaitive(X)- The output of the first (second) rule in our 
fuzzy rule base is then defined by —-cDOF,(x) (cDOF,(x)). In other words, the output of each rule is obtained 
by multiplying the DOF with the value in the Then part of the rule. Finally, the output of the entire fuzzy 
rule base is given by suitably combining the outputs of the different rules. This can be done in many ways. 
One standard choice is to use the so-called center of gravity inferencing method yielding: 


_ -cDOF,(x) + cDOF,(x) 


u(x) 
DOF(x) + DOF,(x) 


The numerator is the sum of the rules’ outputs, and the denominator plays the role of a scaling factor. 
Note that we may also express this as 


DOF (x) Be DOF, (x) 


u(x) =—c A 
DOF(x)+ DOE,(x) DOF(x)+ DOE,(x) 


which implies that the output is always a convex combination of the rules’ outputs. 
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FIGURE 23.2 The function u(x) = —c tanh(x/2) for c= 1. 
Substituting the membership functions yields the controller: 


u(x) = —c(1+ exp(—x))* + c(1— (1+ exp(—x))") 


=-—ctanh (] 
2 


(see Figure 23.2). Note that this can be viewed as a smooth version of the controller: 


-—c, ifx>0, 
u(x) = 
c, ifx<0. 


Summarizing, FM allows us to transform verbal information, stated in the form of If-Then rules, into a 
well-defined mathematical function. Note that the fuzziness here stems from the inherent vagueness of 
verbal terms. This vagueness naturally implies that any modeling process based on verbal information 
would include many degrees of freedom [32]. Yet, it is important to note that the final result of the FM 
process is a completely well-defined mathematical formulation. 


23.3 Foraging Behavior of Ants 


A foraging animal may have a variety of potential paths to a food item. Finding the shortest path mini- 
mizes time, effort, and exposure to hazards. For mass foragers, such as ants, it is also important that all 
foragers reach a consensus when faced with a choice of paths. This is not a trivial task, as ants have very 
limited capabilities of processing and sharing information. Furthermore, this consensus is not reached 
by means of an ordered chain of hierarchy. 

Ants and other social insects have developed an efficient technique for solving these problems [35]. 
While walking from a food source to the nest, or vice versa, ants deposit a chemical substance called 
pheromone, thus forming a pheromone trail. Following ants are able to smell this trail. When faced by 
several alternative paths, they tend to choose those that have been marked by pheromones. This leads to 
a positive feedback mechanism: a marked trail will be chosen by more ants that, in turn, deposit more 
pheromone, thus stimulating even more ants to choose the same trail. 

Goss et al. [36] designed an experiment in order to study the behavior of the Argentine ant Iridomyrmex 
humilis while constructing a trail around an obstacle. A laboratory nest was connected to a food source 


© 2011 by Taylor and Francis Group, LLC 


23-6 Intelligent Systems 


Ry 
e 
Nest 
FIGURE 23.3 Experimental setup with two branches: the left branch is shorter. 


by a double bridge (see Figure 23.3). Ants leaving the nest or returning from the food item to the nest 
must choose a branch. After making the choice, they mark the chosen branch. Ants that take the shorter 
of the two branches return sooner than those using the long branch. Thus, in a given time unit, the short 
branch receives more markings than the long branch. This small difference in the pheromone concen- 
trations is amplified by the positive feedback process. The process generally continues until nearly all the 
foragers take the same branch, neglecting the other one. In this sense, it appears that the entire colony 
has decided to use the short branch. 

The positive feedback process is counteracted by negative feedback due to pheromone evaporation. 
This plays an important role: the markings of obsolete paths, which lead to depleted food sources, disap- 
pear. This increases the chances of detecting new and more relevant paths. 

Note that in this model, no single ant compares the length of the two branches directly. Furthermore, 
the ants only communicate indirectly by laying pheromones and thus locally modifying their environ- 
ment. This form of communication is known as stigmergy [37]. The net result, however, is that the entire 
colony appears to have made a well informed choice of using the shorter branch. 

The fact that simple individual behaviors can lead to a complex emergent behavior has been known 
for centuries. King Solomon marveled at the fact that “the locusts have no king, yet go they forth all 
of them by bands” (Proverbs 30:27). More recently, it was noted that this type of emergent collective 
behavior is a desirable property in many artificial systems. From an engineering point of view, the 
solution of a complex problem using simple agents is an appealing idea, which can save considerable 
time and effort. Furthermore, the specific problem of detecting the shortest path is important in many 
applications, including robot navigation and communication engineering (see, e.g., [38-40]). 


23.4 Fuzzy Modeling of Foraging Behavior 


In this section, we apply FM to transform a verbal description of the foraging behavior into a mathemat- 
ical model. The approach consists of the following stages: (1) identification of the variables, (2) stating 
the verbal information as a set of fuzzy rules relating the variables, (3) defining the fuzzy terms using 
suitable membership functions, and (4) inferring the rule-base to obtain a mathematical model [19]. 
When creating a mathematical model from a verbal description there are always numerous degrees of 
freedom. In the FM approach, this is manifested in the freedom in choosing the components of the fuzzy 
model: the type of membership functions, logical operators, inference method, and the values of the differ- 
ent parameters. The following guidelines may be helpful in selecting the different components of the fuzzy 
model (see also [33] for details on how the various elements in the fuzzy model influence its behavior). 
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First, it is important that the resulting mathematical model has the simplest possible form, in order to 
be amenable to analysis. Thus, for example, a Takagi-Sugeno model with singleton consequents might 
be more suitable than a model based on Zadeh’s compositional rule of inference [33]. 

Second, when modeling real-world systems, the variables are physical quantities with dimensions 
(e.g., length, time). Dimensional analysis [41,42], the process of introducing dimensionless variables, can 
often simplify the resulting equations and decrease the number of parameters. 

Third, sometimes the verbal description of the system is accompanied by measurements of various 
quantities in the system. In this case, methods such as fuzzy clustering, neural learning, or least squares 
approximation (see, e.g., [43-45] and the references therein) can be used to fine-tune the model using 
the discrepancy between the measurements and the model’s output. 

For the foraging behavior in the simple experiment described above, we need to model the choice- 
making process of an ant facing a fork in a path. We use the following verbal description [46]: “If a mass 
forager arrives at a fork in a chemical recruitment trail, the probability that it takes the left branch is all 
the greater as there is more trail pheromone on it than on the right one.” 

An ant is a relatively simple creature, and any biologically feasible description of its behavior must 
also be simple, as is the description above. Naturally, transforming this description into a set of fuzzy 
rules will lead to a simple rule-base. Nevertheless, we will see below that the resulting fuzzy model, 
although simple, has several unique advantages. 


23.4.1 Identification of the Variables 


The variables in the model are the pheromone concentrations on the left and right branches denoted 
Land R, respectively. The output is P = P(L, R), which is the probability of choosing the left branch. 


23.4.2 Fuzzy Rules 


According to the verbal description given above, the probability P of choosing the left branch at the fork 
is directly correlated with the difference in pheromone concentrations D: = L — R. We state this using 
two fuzzy rules: 


Rule 1: If D is positive Then P = 1. 
Rule 2: If D is negative Then P = 0. 


23.4.3 Fuzzy Terms 


A suitable membership function for the term positive, U,,,(-), must satisfy the following constraints: 
L,,(D) is a monotonically increasing function, limp_,_.. H,,,(D) = 0, and limp... U,,,(D) = 1. There are 
good reasons for using the hyperbolic tangent function in both ANNs and fuzzy models [18,47], so we 
use the membership function u,,,(D): = (1 + tanh(qD))/2. The parameter q > 0 determines the slope of 
Ll,,.(D). The term negative is modeled using [,,.,(D) = 1 — U,,,(D). 

As we will see below, this choice of membership functions also leads to a mathematical model for the 
behavior of a colony of ants that is more amenable to analysis than previously suggested models. 


23.4.4 Fuzzy Inferencing 
We use center of gravity inference. This yields 


LL pos(D) 
LU pos(D) + Wneg(D) 
_ 1+tanh(qD) 


P(D)= 


(23.1) 
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FIGURE 23.4 The functions: (a) P(L — R) with q = 0.016 and (b) P, ,, (L, R). 


Note that P(D) € (0, 1) for all De R. For q = 0, P(D) = 1/2 for all D, i-e., the selection between the two 
branches becomes completely random. For q > , P(D) = 1 (P(D) = 0) for L > R (L < R). In other words, 
as q increases, P(D) becomes more sensitive to the difference between L and R. 


23.4.5 Parameter Estimation 


Goss et al. [36] and Deneubourg et al. [48] suggested the following probability function: 


(k+L)" 


Pi«(L,R) = (kK+L)"+(k+R)" 


(23.2) 


As noted in [48], the parameter n determines the degree of nonlinearity of P,,,. The parameter k cor- 
responds to the degree of attraction attributed to an unmarked branch: as k increases, a greater marking 
is necessary for the choice to become significantly nonrandom. 

Note that for L > R (L « R), both (23.1) and (23.2) yield that the probability of choosing the left 
branch goes to one (zero). Our model is simpler and seems more plausible, as it depends only on the dif- 
ference D = L — R, and it includes only a single parameter. 

Deneubourg et al. [48] found that for n = 2 and k = 20, the function (23.2) agrees well with the actual 
behavior observed in an experiment involving Iridomyrmex humilis. Deneubourg et al. do not provide the 
exact biological data they used. In order to obtain (indirectly) a reasonable match with the real biological 
behavior, we tried to match P(D) with the function P, ..(L, R). To do so, we (numerically) solved the problem: 


min y | P(L—R)—Py»o(L,R)/, 


(L,R)eA 


where A = [0, 1, ..., 100] x [0, 1, ..., 100]. The best match is obtained for q = 0.016 (see Figure 23.4). 
In the next sections, we simulate and rigorously analyze the behavior of a colony of “fuzzy” ants, that 
is, ants that choose between two alternative paths according to the function P(D). 


23.5 Stochastic Model 


We model the scenario depicted in Figure 23.3 as a sequence of stochastic events. Initially, at time 
t = 0, all trails are unmarked: L,(0) = R,(0) = L,(0) = R,(0) = 0. Let t denote the time needed to travel 
from the nest to the food item using the left branch. The corresponding time for the right branch is 
rt, with r21. 

At every time step t = 0, 1, ..., 1000, a new ant heads out of the nest and chooses a branch at the fork 
near the nest. The choice is made according to the probability P(L, (0), R,()). If the choice is to follow the 
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left [right] branch, then L, [R,] is increased by 1. This ant reaches the fork near the food source at time 
t+ [t+ rt], adding 1 to L,(t +7) [R,(t+ rt)], and then chooses which branch to use on its return accord- 
ing to P(L,(t + 1), R,(t + 7) [P(L,(t + rt), R,(t + rt))]. Consequently, either L, or R, is increased, and after 
T or rt time steps, 1 is added to L, or R,, respectively. The effect of pheromone evaporation, with rate 
s € (0, 1], is modeled by setting L,(t + 1) = (1 - s) L,(é), R(t + 1) = (1 — s) R,(é) at every time step. 

To estimate the traffic at steady state, we numbered the left/right decisions consecutively, and the 
results presented below are based on decisions 501-1000. 

Figure 23.5 summarizes the results of 1000 simulations with Tt = 20, s = 0.01, and r= 1 (equal branches). 
Using (23.1), almost all simulations end up with the colony choosing one of the two branches. In 523 
simulations, 80%-100% of the ants end up choosing the left branch. In almost all other simulations, 
80% -100% of the ants end up choosing the right branch. 


Number of simulations 


100 


0 
(a) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 


600 


500 


Number of simulations 
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FIGURE 23.5 Percentage of ants that chose the left branch, for r= 1 and s = 0.01: (a) Using P(L — R) with q = 0.016 
and (b) using P,,,(L, R). 
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These results seem to agree with the behavior actually observed in experiments using a double bridge 
with equal length branches: “Should more ants use one of the branches at the beginning of the experi- 
ment, either by chance or for some other reason, then that branch will be most strongly marked and 
attract more ants, and so on until most of the ants use that branch” [46, p. 403]. 

Similar behavior is seen when using the probability function (23.2), but the distribution is more “blurry,” 
as there are considerably more simulations ending with no clear-cut choice of one of the branches. 

Figure 23.6 summarizes the results of 1000 simulations with Tt = 20, s = 0.01, and r = 2, that is, the time 
needed to follow the right branch is twice as long as that of the left branch. It may be seen that using the 
probability function (23.1) leads to a clear-cut distribution: in 807 simulations, 80%-100% of the ants 
choose the shorter branch. In 186 simulations, 0%-20% of the ants chose the shorter branch. Thus, in 
993 of the 1000 simulations the colony converges to a favorable branch, and in 81% of the simulations 
this is indeed the shorter branch. 


Number of simulations 


0 
(a) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 


Number of simulations 


(b) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 


FIGURE 23.6 Percentage of ants that chose the left branch, for r = 2 and s = 0.01: (a) Using P(L — R) with q = 0.016 
and (b) using P,,,(L, R). 
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These results agree with the behavior actually observed in nature: “The experiments show that L. niger 
colonies nearly always select the shorter of two branches, and do so with a large majority of foragers” 
[46, p. 413]. 

Using the probability function (23.2) leads again to a more “blurry” distribution, where in more 
simulations there is no clear convergence toward a favorable branch. 


23.6 Averaged Model 


Following [36], we now consider a deterministic model that describes the “average” concentration of pher- 
omones in the system. This averaged model is a set of four nonlinear delay differential equations (DDEs): 


L(t) = FP,(t) + FP,(t —1)—sLy, (23.3a) 
L,(t) = FP, (t)+FP,(t-1)-sl, (23.3b) 
R(t) = F(l— P(t) + F— P(t -1t)) —sR,, (23.30) 
R,(t)= F-Pt) + FU—-P(t-1t))-sRo, (23.3d) 


where 
P,[P,] is the probability of choosing the left branch at fork 1 [fork 2] as defined in (23.1) 
Fis the number of ants per second leaving the nest 


The first equation can be explained as follows: the change in the pheromone concentration L,(f) is due 
to (1) the ants that choose to use the left branch at fork 1 and deposit pheromone as they start going; (2) 
the ants that choose the left branch at point 2 at time t - t. These reach point 1, and deposit pheromone 
on the left branch, after T s; and (3) the reduction of pheromone due to evaporation. The other equations 
follow similarly. 

Equation 23.3 is similar to the model used in [36,46], but we use P rather than P,,,. It turns out that 
the fact that P = P(L — R) allows us to transform the averaged model into a two-dimensional model, 
which is easier to analyze. Furthermore, using P leads to a novel and interesting link between the aver- 
aged behavior of the ant colony and mathematical models used in the theory of ANNs (see Section 23.8). 


23.7 Simulations 


We simulated (23.3) and compared the results to a Monte Carlo simulation of the stochastic model with 
a colony of 1000 foragers. Note that (23.3) describes pheromone concentrations, not ant numbers. Yet 
there is, of course, a correspondence between the two since the pheromones are laid by ants. We used the 
parameters F = 1, T = 20, r= 2, and the initial conditions L,(f) = 1, L,() = R,® = R,() = 0, for all te [—rt, 0]. 
To analyze the effect of evaporation, we considered two different values of s. 

Figure 23.7 depicts the pheromone concentrations as a function of time for s = 0.012. The stochas- 
tic model and the averaged model behave similarly. Initially, the concentrations on both branches are 
equal. As time progresses, the left branch, which is the shorter one, receives more and more markings. 
The difference L,(t) - R,(f) converges to a steady state value of 164.98. 

Figure 23.8 depicts the results of the simulations for a higher value of evaporation rate, namely, s = 0.2. 
In this case, the traffic tends to distribute equally along the two branches. The reason is that the posi- 
tive feedback process of pheromone laying is ineffective because the pheromones evaporate faster than 
the ants can lay them. This makes it impossible to detect the shorter branch. This is in agreement with 
the behavior actually observed in nature: “Only if the amount of ants arriving at the fork is insufficient 
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FIGURE 23.7 _L,(¢) (solid line) and R,(#) (dashed line) for r = 2, s = 0.012: (a) Solution of DDE (23.3) with q = 0.016 
and (b) Monte Carlo simulation. 
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FIGURE 23.8 L,(f) (solid line) and R,(¢) (dashed line) for r = 2, s = 0.01: (a) Solution of DDE (23.3) with q = 0.016 
and (b) Monte Carlo simulation. 


to maintain the pheromone trail in the face of evaporation will no choice be arrived at and the traffic 
distributed equally over the two branches” [46, p. 408]. 


23.8 Analysis of the Averaged Model 


In this section, we provide a rigorous analysis of the averaged model. Let v; (f): = (Lf) — R(O/F j = 1, 2, 
that is, the scaled difference between the pheromone concentrations on the left-hand and right-hand 
sides of fork j. Using (23.3) and (23.1) yields 


tanh(pv2(t — T)) + tanh( pv2(t —rt)) 


v,(t) =—sv,(t) + tanh(pv,(£)) + 5 


(23.4) 


AOE OTC CRONE: tanh(pv,(t —7)) ‘antes (t —rt)) 
where p: = qF > 0. Note that this simplification from a fourth-order to a second-order DDE is possible 
because our probability function, unlike (23.2), depends only on the difference L — R. 

Models in the form (23.4) were used in the context of Hopfield-type ANNs with time delays (see 
[49] and the references therein). In this context, (23.4) represents a system of two dynamic neurons, 
each possessing nonlinear feedback, and coupled via nonlinear connections. The time delays represent 
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propagation times along these connections. This yields an interesting and novel connection between 
the aggregated behavior of the colony and classical models used in the theory of ANNs. The set of 
ants choosing the left (right) path corresponds to the state of the first (second) neuron. ‘The effect of 
the chemical communication between the ants corresponds to the time-delayed feedback connections 
between the neurons. 


23.8.1 Equilibrium Solutions 


The equilibrium solutions of (23.4) are v(t) = (vy, v)’, where v satisfies 


sv —2tanh(pv) =0. (23.5) 


The properties of the hyperbolic tangent function yield the following result. 


Proposition 23.1 Ifs > 2p, then the unique solution of (23.5) is v = 0, so (23.4) admits a unique equi- 
librium solution v(t) = 0. If s € (0, 2p), then (23.5) admits three solutions 0,v,—v, with v >0, and (23.4) 
admits three equilibrium solutions: v(t) = 0, v(t) = v!: = (y, v)7, and v(f) =—v1. 


23.8.2 Stability 


For the sake of completeness, we recall some stability definitions for DDEs (for more details, see [50-52]). 
Consider the DDE 


x(t) = f(x(t), x(t-d)), t2to, (23.6) 


with the initial condition x() = o(@), t € [t) - d, to], and suppose that x(t) = 0 is an equilibrium solu- 
tion. For a continuous function 6: [ty —d,ty] > R”, define the continuous norm by ||6]|,: = max{|| (6) ||: 
Oe [t)—d, ti}. 


Definition 23.1 The solution 0 is said to be uniformly stable if for any tp € R and any € >0 there exists 
5=6(€)>0 such that ||6||, < 5 implies that || x(t) ||< € for t >t. It is uniformly asymptotically stable if 
it is uniformly stable, and there exists 5, > 0 such that for any o > 0, there exists T = T(6,, 04), such that 
|| ||. < 8, implies that || x(#)||<o for tf >t) + T. It is globally uniformly asymptotically stable (GUAS) if it is 
uniformly asymptotically stable and 6, can be an arbitrarily large number. 

Proposition 23.1 suggests that we need to consider the two cases s > 2p and s € (0, 2p) separately. 


23.8.2.1 High Evaporation 


The next result shows that if s > 2p, then L,(f) = R,(), i= 1, 2, isa GUAS solution of (23.3). In other words, 
when the evaporation rate is high and the pheromones cannot accumulate, the positive feedback process 
leading to a favorable trail cannot take place and eventually the traffic will be divided equally along the 
two possible branches. 


Theorem 23.1 If s > 2p > 0, then 0 is a GUAS solution of (23.4) for any T> 0 andr21. 
Proof: See [53]. 


23.8.2.2 Low Evaporation 


For s € (0, 2p), the system admits three equilibrium solutions: 0, v', and —-v!. 
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Proposition 23.2 Ifs € (0, 2p), then 0 is an unstable solution of (23.4), and both v! and —v! are uni- 
formly asymptotically stable solutions. 
Proof: See [53]. 


Thus, L,() — R,() = L(t) — R,(t) = Fv! and L,() — R,(@) = L,@) - R,(f) = -Fv' are stable solutions of the 
averaged model. In other words, for low evaporation the system has a tendency toward a non-symmetric 
state, where one trail is more favorable than the other. 

Summarizing, the analysis of the model implies that the system undergoes a bifurcation when s = 2p. 
Recall that s is the evaporation rate, and p is the product of the parameter q that determines the “sensi- 
tivity” of P(D), and the rate of ants leaving the nest F. If s < 2p (i.e., low evaporation), the pheromone lay- 
ing process functions properly and leads to steady state solutions where one trail is more favorable than 
the other. If s > 2p (high evaporation), then all solutions converge to the steady state where the traffic is 
divided equally along the two possible branches. These results hold for any delay t > 0, as both Theorem 
23.1 and Proposition 23.2 are delay-independent. 

Thus, the model predicts that if the evaporation rate s increases, the ants must respond by increasing 
either (1) the rate of ants leaving the nest or (2) their sensitivity to small differences in the pheromone 
concentrations. It might be interesting to try and verify this prediction in the real biological system. 


23.9 Conclusions 


In many fields of science, researchers provide verbal descriptions of various phenomena. FM is a simple 
and direct approach for transforming these verbal descriptions into well-defined mathematical models. 

The development of such models can also be used to address various engineering problems. This is 
because many artificial systems must function in the real world and address problems similar to those 
encountered by biological agents such as plants or animals. The field of biomimicry is concerned with 
developing artificial systems inspired by the behavior of biological agents. An important component in 
this field is the ability to perform reverse engineering of an animal’s functioning, and then implement 
this behavior in an artificial system. We believe that the FM approach may be suitable for addressing 
biomimicry in a systematic manner. Namely, start with a verbal description of an animal’s behavior 
(e.g., foraging in ants) and, using fuzzy logic theory, obtain a mathematical model of this behavior which 
can be implemented by artificial systems (e.g., autonomous robots). 

In this chapter, we described a first step in this direction by applying FM to transform a verbal 
description of the foraging behavior of ants into a well-defined mathematical model. Simulations and 
rigorous analysis of the resulting model demonstrate good fidelity with the behavior actually observed 
in nature. Furthermore, when the fuzzy model is substituted in a mathematical model for the colony of 
foragers, it leads to an interesting connection with models used in the theory of ANNs. Unlike previous 
models, the fuzzy model is also simple enough to allow a rather detailed analytical analysis. 

The collective behavior of social insects has inspired many interesting engineering designs (see, e.g., 
[39,54]). Further research is necessary in order to study the application of the model studied here to 
various engineering problems. 


References 


1. H. Goldstein, Classical Mechanics. Addison-Wesley, Reading, MA, 1980. 

2. J. Giarratano and G. Riley, Expert Systems: Principles and Programming, 3rd edn. PWS Publishing 
Company, Boston, MA, 1998. 

3. L. A. Zadeh, Fuzzy sets, Information and Control, 8, 338-353, 1965. 

4. L. A. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes, 
IEEE Transactions Systems, Man, Cybernetics, 3, 28-44, 1973. 


© 2011 by Taylor and Francis Group, LLC 


Fuzzy Modeling of Animal Behavior and Biomimcry: The Fuzzy Ant 23-15 


5. 


6. 
Ts 


15. 


16. 


17. 
18. 


19. 


20. 
21. 


22. 


23. 


24, 


25. 
26. 


27. 


28. 


29. 
30. 


31. 
32. 


W. Siler and J. J. Buckley, Fuzzy Expert Systems and Fuzzy Reasoning. Wiley-Interscience, Hoboken, 
NJ; 2005. 

A. Kandel, Ed., Fuzzy Expert Systems. CRC Press, Boca Raton, FL, 1992. 

K. Hirota and M. Sugeno, Eds., Industrial Applications of Fuzzy Technology in the World. World 
Scientific, River Edge, NJ, 1995. 


. D. Dubois, H. Prade, and R. R. Yager, Eds., Fuzzy Information Engineering. Wiley, New York, 1997. 
. K. Tanaka and M. Sugeno, Introduction to fuzzy modeling, in Fuzzy Systems: Modeling and Control, 


H. T. Nguyen and M. Sugeno, Eds. Kluwer, Norwell, MA, 1998, pp. 63-89. 


. T. Terano, K. Asai, and M. Sugeno, Applied Fuzzy Systems. AP Professional, Cambridge, MA, 1994. 
. R.R. Yager and D. P. Filev, Essentials of Fuzzy Modeling and Control. John Wiley & Sons, Chichester, 


U.K, 1994. 


. J. Yen, R. Langari, and L. Zadeh, Eds., Industrial Applications of Fuzzy Logic and Intelligent Systems. 


IEEE Press, Piscataway, NJ, 1995. 


. W. Pedrycz, Ed., Fuzzy Sets Engineering. CRC Press, Boca Raton, FL, 1995. 
. D. Dubois, H. T. Nguyen, H. Prade, and M. Sugeno, Introduction: The real contribution of Fuzzy sys- 


tems, in Fuzzy Systems: Modeling and Control, H. T. Nguyen and M. Sugeno, Eds. Kluwer, Norwell, 
MA, 1998, pp. 1-17. 

M. Margaliot and G. Langholz, Fuzzy Lyapunov based approach to the design of fuzzy controllers, 
Fuzzy Sets Systems, 106, 49-59, 1999. 

M. Margaliot and G. Langholz, New Approaches to Fuzzy Modeling and Control—Design and 
Analysis. World Scienticfic, Singapore, 2000. 

L. A. Zadeh, Fuzzy logic = computing with words, IEEE Transactions Fuzzy Systems, 4, 103-111, 1996. 
E. Kolman and M. Margaliot, Knowledge-Based Neurocomputing: A Fuzzy Logic Approach. Springer, 
Berlin, Germany, 2009. 

E. Tron and M. Margaliot, Mathematical modeling of observed natural behavior: a fuzzy logic 
approach, Fuzzy Sets Systems, 146, 437-450, 2004. 

K. Z. Lorenz, King Solomon’ Ring: New Light on Animal Ways. Methuen & Co., London, U.K., 1957. 
E. Tron and M. Margaliot, How does the Dendrocoleum lacteum orient to light? A fuzzy modeling 
approach, Fuzzy Sets Systems, 155, 236-251, 2005. 

I. L. Bajec, N. Zimic, and M. Mraz, Simulating flocks on the wing: The fuzzy approach, Journal of 
Theoretical Biology, 233, 199-220, 2005. 

I. Rashkovsky and M. Margaliot, Nicholson’s blowies revisited: A fuzzy modeling approach, Fuzzy 
Sets Systems, 158, 1083-1096, 2007. 

D. Laschov and M. Margaliot, Mathematical modeling of the lambda switch: A fuzzy logic approach, 
Journal of Theoretical Biology, 260, 475-489, 2009. 

N. Tinbergen, The Study of Instinct. Oxford University Press, London, U.K., 1969. 

G. S. Fraenkel and D. L. Gunn, The Orientation of Animals: Kineses, Taxes, and Compass Reactions. 
Dover Publications, New York, 1961. 

S. Schockaert, M. De Cock, C. Cornelis, and E. E. Kerre, Fuzzy ant based clustering, in Ant Colony, 
Optimization and Swarm Intelligence, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, 
FE. Mondada, and T. Stutzle, Eds. Springer-Verlag, Berlin, Germany, 2004, pp. 342-349. 

Y. Bar-Cohen and C. Breazeal, Eds., Biologically Inspired Intelligent Robots. SPIE Press, Bellingham, 
WA, 2003. 

C. Mattheck, Design in Nature: Learning from Trees. Springer-Verlag, Berlin, Germany, 1998. 

C. Chang and P. Gaudiano, Eds., Robotics and autonomous systems, Special Issue on Biomimetic 
Robotics, 30, 39-64, 2000. 

K. M. Passino, Biomimicry for Optimization, Control, and Automation. Springer, London, U.K., 2004. 
M. Margaliot, Biomimicry and fuzzy modeling: A match made in heaven, IEEE Computational 
Intelligence Magazine, 3, 38-48, 2008. 


© 2011 by Taylor and Francis Group, LLC 


23-16 Intelligent Systems 


33 


34. 


35: 
36. 


37. 


38. 


39. 


40. 


41. 


42. 
43. 


44, 


45. 


46. 


47. 


48. 


49, 


50. 


51. 


52. 


53. 


54 


. J. M. C. Sousa and U. Kaymak, Fuzzy Decision Making in Modeling and Control. World Scientific, 
Singapore, 2002. 

V. Novak, Are fuzzy sets a reasonable tool for modeling vague phenomena? Fuzzy Sets Systems, 156, 
341-348, 2005. 

E. O. Wilson, Sociobiology: The New Synthesis. Harvard University Press, Cambridge, MA, 1975. 

S. Goss, S. Aron, J. L. Deneubourg, and J. M. Pasteels, Self-organized shortcuts in the Argentine ant, 
Naturwissenschaften, 76, 579-581, 1989. 

M. Dorigo, M. Birattari, and T. Stutzle, Ant colony optimization: Artificial ants as a computational 
intelligence technique, IEEE Computational Intelligence Magazine, 1, 28-39, 2006. 

H. V. D. Parunak, Go to the ant: Engineering principles from natural multi agent systems, Annals 
Operations Research, 75, 69-101, 1997. 

E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm Intelligence: From Natural to Artificial Systems. 
Oxford University Press, Oxford, U.K., 1999. 

R. Schoonderwoerd, O. E. Holland, J. L. Bruten, and L. J. M. Rothkrantz, Ant-based load balancing 
in telecommunications networks, Adaptive Behavior, 5, 169-207, 1997. 

G. W. Bluman and S. C. Anco, Symmetry and Integration Methods for Differential Equations. 
Springer-Verlag, New York, 2002. 

L. A. Segel, Simplification and scaling, SIAM Review, 14, 547-571, 1972. 

S. Guillaume, Designing fuzzy inference systems from data: An interpretability-oriented review, 
IEEE Transactions Fuzzy Systems, 9, 426-443, 2001. 

G. Bontempi, H. Bersini, and M. Birattari, The local paradigm for modeling and control: From 
neuro-fuzzy to lazy learning, Fuzzy Sets Systems, 121, 59-72, 2001. 

J. S.R. Jang, C. T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach 
to Learning and Machine Intelligence. Prentice-Hall, Englewood Cliffs, NJ, 1997. 

R. Beckers, J. L. Deneubourg, and S. Goss, Trails and U-turns in the selection of a path by the ant 
Lasius niger, Journal of Theoretical Biology, 159, 397-415, 1992. 

H. T. Nguyen, V. Kreinovich, M. Margaliot, and G. Langholz, Hyperbolic approach to fuzzy con- 
trol is optimal, in Proceedings of the 10th IEEE International Conference on Fuzzy Systems (FUZZ- 
IEEE’2001), Melbourne, Australia, 2001, pp. 888-891. 

J. L. Deneubourg, S. Aron, S. Goss, and J. M. Pasteels, The self-organizing exploratory pattern of the 
Argentine ant, Journal of Insect Behavior, 3, 159-168, 1990. 

L. P. Shayer and S. A. Campbell, Stability, bifurcation, and multistability in a system of two coupled 
neurons with multiple time delays, SIAM Journal on Applied Mathematics, 61, 673-700, 2000. 

K. Gu, V. L. Kharitonov, and J. Chen, Stability of Time-Delay Systems. Birkhauser, Boston, MA, 2003. 
S.-I. Niculescu, E. I. Verriest, L. Dugard, and J.-M. Dion, Stability and robust stability of time-delay 
systems: A guided tour, in Stability and Control of Time-Delay Systems, L. Dugard and E. I. Verriest, 
Eds. Springer, Berlin, Germany, 1998, pp. 1-71. 

V. Lakshmikantham and S. Leela, Differential and Integral Inequalities: Theory and Applications. 
Academic Press, New York, 1969. 

V. Rozin and M. Margaliot, The fuzzy ant, IEEE Computational Intelligence Magazine, 26(2), 18-28, 2007. 
. M. Dorigo and T. Stutzle, Ant Colony Optimization. MIT Press, Cambridge, MA, 2004. 


© 2011 by Taylor and Francis Group, LLC 


Optimizations 


24 Multiobjective Optimization Methods Tak Ming Chan, Kit Sang Tang, 
Sam Kwong, and Kim Fung M0 uu.eessessssssssssssssesseescssesssesscssesssssesssessssessessesssssssessesseeseenes 24-1 


Introduction +» Multiobjective Evolutionary Algorithms « Concluding 
Remarks « References 


25 Fundamentals of Evolutionary Multiobjective Optimization 
Catlos A Coello COU Oseiscsssdessissacaiiesisestisssdelolesodasoneseteludesiseigebstesevebssetdsehagesststsuesrdobiseayetecnse’s 25-1 
Basic Concepts « Use of Evolutionary Algorithms « Multi-Objective 
Evolutionary Algorithms « Applications « Current Challenges « 
Conclusions « Acknowledgment « References 


26 Ant Colony Optimization Christian Blum and Manuel Lopez-IbdeZ .....eessescseseeees 26-1 
Introduction «+ Combinatorial Optimization Problems « Optimization 
Algorithms « Ant Colony Optimization « Modern ACO Algorithms + Extensions 
of the ACO Metaheuristic « Applications of ACO Algorithms « Concluding 
Remarks « References 


27 Heuristics for Two-Dimensional Bin-Packing Problems Tak Ming Chan, 
Filipe Alvelos, Elsa Silva, and J.M. Valério de Carvalho ...ssssessesssssssesessessesessessssseseeses 27-1 


Introduction + Bin-Packing Problems + Heuristics « Computational 
Results « Conclusions « References 


28 Particle Swarm Optimization Adam Slowik......cccsssssssssssesessessessesseneeseeneeneeenesnense 28-1 
Introduction + Particle Swarm Optimization Algorithm + Modifications of PSO 
Algorithm « Example « Summary « References 


Iv-1 


© 2011 by Taylor and Francis Group, LLC 


24 


Multiobjective 
Optimization Methods 


Tak Ming Chan 


University of Minho 

Kit Sang Tang ZAM. Tint QR sess ssesics ersstsciernstnlriptsiealslaoeieoenseinnaareiannmngaeunlaliins 24-1 

City University Classical Methodologies 

of Hong Kong 24.2 Multiobjective Evolutionary Algorithms... eee 24-2 
Multiobjective Genetic Algorithm + Niched Pareto 

Sam Kwong Genetic Algorithm 2 « Non-Dominated Sorting Genetic 

City University Algorithm 2 « Strength Pareto Evolutionary Algorithm 2 « Pareto 

of Homg Kong Archived Evolution Strategy « Micro Genetic Algorithm + Jumping 

Kim Fun g Man Genes « Particle Swarm Optimization 

Cie Cavabeteg 2 OTIC Ree cis ctescssedvccwccacioseineoscaeediboeisabioceumanncnasienss 24-22 

of Hong Kong Re Reet Ss rica diene ae 24-22 


24.1 Introduction 


Evolutionary computation in multiobjective optimization (MO) has now become a popular technique 
for solving problems that are considered to be conflicting, constrained, and sometimes mathematically 
intangible [KD01,CVL02]. This chapter aims to bring out the main features of MO starting from the 
classical approaches to state-of-the-art methodologies, all of which can be used for practical designs. 


24.1.1 Classical Methodologies 


The popular optimization methods for solving MO problems are generally classified into three categories: 
(1) enumerative methods, (2) deterministic methods, and (3) stochastic methods [CVL02,MF00,RL98]. 
Enumerative methods are computationally expensive. Hence, these are now seldom used by researchers. 
On the other hand, a vast literature for either the deterministic or stochastic methods has become 
available. Some useful references for these methodologies are presented in Table 24.1. 


24.1.1.1 Deterministic Methods 


The deterministic methods do not suffer the enormous search space problem, which is usually 
confronted by enumerative methods. These methods incorporate domain knowledge (heuristics) so as 
to limit the search space for finding acceptable solutions in time and produce acceptable solutions 
in practice [MF00,BB88,NN96]. However, these are often ineffective to cope with high dimensional, 
multimodal, and non-deterministic polynomial time (NP)-complete MO problems. The deficiency is 
largely due to their heuristic search performance, which in turn limits the search space [DG89]. 


24-1 
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TABLE 24.1 Deterministic and Stochastic Optimization Methods 


Types of Optimization Methods 


Deterministic Stochastic 

Greedy [CLRO1,BB96] Random search/walk [ZZ03,FS76] 

Hill-climbing [HS95] Monte Carlo [JGO3,RC87] 

Branch & bound [HT96,FKS02] Simulated annealing [VLA87,AZ01] 

Depth-first [JP84] Tabu search [GL97,FG89,GTW93] 
Breadth-first [JP84] Evolutionary computation/algorithms | [DLD00,DF00,TMK08,CPL04] 
Best-first [JP84] Mathematical programming [MM86,HL95] 
Calculus-based [HA82] 


Mathematical programming [MM86,HL95] 


The other disadvantage is the point-by-point approach [KD01], which only yields one optimized solu- 
tion per single simulation run. Multiple runs are therefore required should a set of optimized solutions 
be sought, but then, at their best, would only yield suboptimal solutions. 


24.1.1.2 Stochastic Methods 


Rather than the use of enumerative and deterministic methods for solving irregular MO problems, sto- 
chastic methods are the other alternative [CVL02]. Stochastic methods can acquire multiple solutions in 
a single simulation run. Also, specific schemes can be devised to prevent the solutions from falling into 
the suboptimal domain [KDO1]. 

These methods need an encode/decode mapping mechanism for coordinating between the problem and 
algorithm domains. Furthermore, a specific function is required for assigning the fitness values to pos- 
sible solutions. This approach does not guarantee true Pareto-optimal solutions but can offer reasonably 
adequate solutions to a number of MO problems where the deterministic methods failed to deliver [DG89]. 


24.2 Multiobjective Evolutionary Algorithms 


Evolutionary algorithms belong to a class of stochastic optimizing schemes. These algorithms emu- 
late the process of the natural selection for which the noted philosopher, Herbert Spencer, coined the 
phrase “Survival of the fittest.” These algorithms share a number of common properties and genetic 
operations. 

The evolution of each individual is subject to the rules of genetics-inspired operations, i.e., selec- 
tion, crossover, and mutation. Each individual in the population represents a potential solution to a 
certain problem. Also, a fitness value is assigned to each of individuals so as to evaluate them within the 
measurement-checking mechanism. The selection tends to allow the high-fitness individuals to repro- 
duce more as compared to the low-fitness individuals. The genetic exploration allows the crossover to 
exchange the information amongst the individuals, while the mutation introduces a small perturbation 
within the structure of an individual. 

Three common evolutionary algorithms are (1) evolutionary programming, (2) evolution strategies, 
and (3) genetic algorithms. The differences between these three evolutionary algorithms are listed in 
Table 24.2. But the same principle applies to each one of them. Further, details of evolutionary program- 
ming and evolution strategies can be obtained in [HS95,DF00]. Whereas for genetic algorithms, see 
[DG89,GC00,DC99,CL04]. 

The essential properties of multiobjective evolutionary algorithms (MOEAs) are outlined [TMC06]. 
These include (1) decision variables, (2) constraints, (3) pareto optimality, and (4) convergence and 
diversity. 
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TABLE 24.2 Differences between Three Evolutionary Algorithms 


Evolutionary Algorithm Representation Genetic Operators 

Evolutionary programming Real values Mutation and (uu + A) selection 

Evolution strategies Real values and strategy parameters Crossover, mutation, and (1 + A) or (uu, A) selection 
Genetic algorithms Binary or real values Crossover, mutation, and selection 


Note: |. is the number of parents and A is the number of offsprings. 


Currently, a number of MOEAs can be applied in practice: 


1. Multiobjective genetic algorithm (MOGA) [FF98,FF93] 

2. Niched Pareto genetic algorithm 2 (NPGA2) [EMH01] 

3. Nondominated sorting genetic algorithm 2 (NSGA2) [DAM00,DAM02] 
4. Strength pareto evolutionary algorithm 2 (SPEA2) [ZLT01] 

5. Pareto archived evolution strategy (PAES) [KC09,KC00] 

6. Micro-genetic algorithm (MICRO-GA) [CTP01,CP01] 

7. Jumping genes (JG) [TMC06,TMK08] 

8. Particle swarm optimization (PSO) [CPL04,VE06] 


All of these MOEAs are genetic algorithm-based methods except PAES, which is an evolution strategy- 
based method. The Pareto-based fitness assignment is used to identify non-dominated Pareto-optimal 
solutions. The basic principles of these algorithms are briefly outlined in this chapter. 


24.2.1 Multiobjective Genetic Algorithm 


MOGA was proposed by Fonseca and Fleming [FF98,FF93]. It has three features: (1) a modified version 
of Goldberg’s ranking scheme, (2) modified fitness assignment, and (3) niche count. The flowchart of the 
MOGA is shown in Figure 24.1. 

A modified ranking scheme, which is slightly different from Goldberg’s one [FF93], is used in the 
MOGA. The scheme can be represented graphically as depicted in Figure 24.2. In this scheme, the rank 
of an individual I is given by the number of individuals (q) that dominates J plus 1. That is, 


Rank(I) = 1+ q, 
if the individual J is dominated by other q individuals. 


Modified fitness assignment 
The procedure of modified rank-based fitness assignment is shown as follows: 


1. Sort population according to rank. 

2. Assign fitness to individuals by interpolation from the best (rank 1) individual to the worst (rank 
n<M where M is the population size) on the basis of some function, usually linear or exponential, 
but possibly with some other types. 

3. Average the fitness assigned to individuals with the same rank, so that all of them are sampled at 
the same rate while keeping the global population fitness constant. 


Fitness sharing 
To obtain a uniformly distributed and widespread set of Pareto-optimal solutions in multiobjective 
optimization, there exist at least two major difficulties: (1) the finite population size and stochastic 
selection errors and (2) convergence of the population to a small region of Pareto-optimal front (i.e., a 
phenomenon occurs in both natural and artificial evolution, also called genetic drift). 

The technique of fitness sharing was suggested to avoid such a problem [FF98]. It utilizes individual 
competition for finite resources in a closed environment. Similar individuals reduce each other’s fit- 
ness by competing for the same resource. The similarity of two individuals is measured either in the 
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FIGURE 24.1 Flowchart of MOGA. 


FIGURE 24.2 Fonseca and Fleming’s ranking scheme (for minimizing f, and f,). 
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genotypic space (the number of gene difference between two individuals) or in the phenotypic space (the 
distance in objective values of two individuals). 

To perform fitness sharing, a sharing function is required to determine the fitness reduction of a 
chromosome on the basis of the crowding degree caused by its neighbors. The sharing function com- 
monly used is 


d. Oo 
1- ul > if di; < Os; are 
sf (dij) = Ge ] : : 


0, otherwise 


(24.1) 


where 
a is a constant for adjusting the shape of the sharing function 
O,nare is the niche radius chosen by the user for minimal separation desired 
d, is the distance between two individuals used in encoding space (genotypic sharing) or decoding 
space (phenotypic sharing) 


Then, the shared fitness of a chromosome can be acquired by dividing its fitness f(j) by its niche count n; 


f= £D (24.2) 


i 


population _ size 


n=  s(ds) (24.3) 


where 
f()/) is the fitness of the chromosome considered 
n, is the niche count including the considered chromosome itself 


24.2.2 Niched Pareto Genetic Algorithm 2 


NPGA2 is an improved version of NPGA [HNG94] suggested by Erickson et al. IEMH01]. The flowchart 
of the NPGA2 is shown in Figure 24.3. The characteristic of this algorithm is the use of fitness sharing 
when the tournament selection ends in a tie. A tie means that both picked two individuals are domi- 
nated or nondominated. If happens, the niche count n, in Equation 24.3 is computed for each selected 
individual. The individual with the lowest niche count will be the winner, and therefore be included in 
the mating pool. 

Niche counts are calculated by using individuals in the partially filled next generation population 
rather than the current generation population. This method is called continuously updated fitness shar- 
ing [OGC94]. The combination of tournament selection and fitness sharing would lead to chaotic per- 
turbations of the population composition. Note that the values of the objective functions should be 
scaled to equal ranges for estimating the niche count, ie., 


, fi — fimin 
= (24.4) 
fi Fisnay — fimin 


where 
fi is the scaled value of objective f, 
Simin is the minimum value of objective f; 
Fimax is the maximum value of objective f; 
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24.2.3 Non-Dominated Sorting Genetic Algorithm 2 


NSGA2 is the enhanced version of NSGA [SD94] proposed by Deb et al. [DAM00,DAM02]. The flow- 
chart of the NSGA2 is shown in Figure 24.4. It has four peculiarities: (1) fast non-dominated sorting, 
(2) crowding distance assignment, (3) crowded-comparison operator, and (4) elitism strategy. 


24.2.3.1 Fast Non-Dominated Sorting Approach 


The invention of the fast non-dominated sorting approach in the NSGA2 is aimed at reducing the 
high computational complexity O(MN*), of traditional non-dominated sorting algorithm adopted in 
the NSGA. 

In the traditional non-dominated sorting approach, to find solutions lying on the first non-dominated 
front for a population size N, each solution should be compared with every other solution in the popula- 
tion. Therefore, O/MN) comparisons are needed for each solution; and the total complexity is O/MN7?) 
for searching all solutions of the first non-dominated front where M is the total number of objectives. 

To obtain the second non-dominated front, the solutions of the first front are temporarily discounted 
and the above procedure is to be repeated. In the worst possible circumstance, the second front O(MN?) 
computations are required. This applies to the subsequent non-dominated fronts (i.e., third, fourth, and 
so on). Hence, when there are N fronts and each front has only one solution as the worst case scenario, 
overall O(MN?*) computations are necessary. 

On the other hand, the fast non-dominated sorting approach, whose pseudo code and flowchart are 
shown in Figures 24.5 and 24.6, respectively, can firstly to initiate domination count n, (the number of 
solutions which dominate the solution p) and calculate S, (a set of solutions that the solution p domi- 
nates) to proceed O(MN?) comparisons. ‘Then, all solutions with their n, =O are said to be the members 
of the first non-dominated front. 
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FIGURE 24.4 Flowchart of NSGA2. 


for eachpe P 


Ss, =o 
n,= 0 
for eachge P 
if (p<q) // |\f p domnates gq 
S, = S,U {gq} // Add q to the set of solutions doninated by p 
else if (q<p) 
n, =n,+1 // |ncrease the domnation counter of p 
if n,=0 // p belongs to the first front 
Prank =1 
Fy = F,U{p} 
i=l // Initialize the front counter 
while F, #0 
Q=0 // Used to store the nenbers of the next front 


for eachpe F, 
for eachge s, 


= 
5 
lI 
o 
~ 


/ q belongs to the next front 


FIGURE 24.5 Pseudo code of the fast non-dominated sort. 
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FIGURE 24.6 Flowchart of the fast non-dominated sort. 


For each solution p belonging to the first front, n, of each solution q belonging to S, is decreased by one. If 
n, = 0 for any solution q, the solution q is placed to a separate list Q. All solutions in Q are the members of the 
second non-dominated front. The above procedure is repeated until all non-dominated fronts are completed. 

To calculate the computational complexity as referred to Figure 24.5, the first inner loop (for each 
pé F) is executed N times since each individual can be the member of utmost non-dominated front. 
Also, the second inner loop (for each q € S,) can be executed at maximum (N — 1) times for each indi- 
vidual (i.e., each individual dominates (N — 1) and each domination check requires at most M compari- 
sons). Therefore, this results overall O(MN?) computations. 


24.2.3.2 Crowded-Comparison Approach 


The goal of using the crowded-comparison approach in the NSGA2 is to eliminate the difficulties arises 
from the well-known sharing function in the NSGA. The merit of this new approach is the preserva- 
tion of population diversity without using any user-defined parameter. This approach consists of the 
crowding-distance assignment and crowded-comparison operator. 


Crowding-Distance Assignment 

In order to obtain diversified Pareto-optimal solutions, the crowding-distance assignment can be 
devised. Its purpose is to estimate the density of solutions surrounding a particular solution within 
the population. The preferable solutions in the less crowded area may be chosen on the basis of their 
assigned crowding distances. 

The crowding distance measures the perimeter of the cuboid, which is formed by using the nearest neigh- 
bors as the vertices. As indicated in Figure 24.7, the crowding distance of the ith non-dominated solution 
marked with the solid circle is the average side length of the cuboid represented by a dashed box assuming 
that the non-dominated solutions i, i- 1, and i+ 1 in Figure 24.7 have the function values (fi, f2),( fi, fp) and 
( ft gf respectively. The crowding distance I[i)gictance of the solution i in the non-dominated set I is given by 


tie fil+ fr fil+ 2 fil (24.5) 


I LAlaistance = 
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FIGURE 24.7 Crowding-distance computation. 


1={] // Number of solutions in r 

for each i 
set Hil aistance = 0 // Initialize distance 

for each objective m 
I =sort (4, m // Sort using each objective value 
Wl aistance = UA! laistance = © // Boundary points nust be selected 


for i =2 to (1-1) 


fit+l.m-fi-1)m 
£m — fin 


EA) distance = 4A distance + 


FIGURE 24.8 Pseudo code of the crowding-distance assignment. 


The pseudo code and flowchart of the crowding-distance assignment are shown in Figures 24.8 and 24.9, 
respectively. The procedures are given as follows: 


Step 1: Set the crowding distance I[f] gisance of each non-dominated solution i as zero. 

Step 2: Set the counter of objective function m = 1. 

Step 3: Sort the population according to each objective value of the mth objective function in 
the ascending order. 

Step 4: For the mth objective function, assign a nominated relatively large Ii] gisance for the extreme 
solutions of the non-dominated front, ie., the smallest and largest function values. This 
selected Ili] jistance Must be greater than the crowding distance values of other solutions within 
the same non-dominated front. 

Step 5: For the mth objective function, calculate Ii) gistance for each of remaining non-dominated solu- 
tions i = 2, 3, ...,/— 1 by using the following equation 


; . T[i+1].m—I[i-1].m 
Ti aistance = Tildistance + | Lae = (24.6) 


m m 


where 
lis the total number of solutions in a non-dominated set I 
I[i].m is the mth objective function value of the ith solution in I 
fa and fn" are the maximum and minimum values of the mth objective function 


Step 6:m =m + 1.Ifm< M where M is the total number of objective functions, go to Step 3. 
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FIGURE 24.9 Flowchart of the crowding-distance assignment. 


A solution with a smaller crowding distance value means that it is more crowded. In other words, this 
solution and its other surrounding solutions are close at distance. 


24.2.3.3 Crowded-Comparison Operator 


The crowded-comparison operator, which is implemented in the selection process, is to obtain uni- 
formly distributed Pareto-optimal solutions. Should any two picked solutions have different ranks, the 
solution with the lower rank will be chosen. However, if their ranks are the same, the solution with 
the larger crowding distance (i.e., located in a less crowded region) is preferred. 


24.2.3.4 Elitism Strategy 


The procedure of executing the elitism is outlined as follows: 


Step 1: Assuming that the tth generation is considered, combine a parent population P, and an 
offspring population Q, to form a population R, with size 2N. 

Step 2: Sort R, using the fast non-dominated sorting technique and then identify the different 
non-dominated fronts F,, F,, .... 

Step 3: Include these fronts into the new parent population P,,, one by one until P,,, with size Nis full. 


However, not all the members of a particular front can be fully added to P,,,. For example, the 
fronts F,, F, are completely added to P,,, and therefore P,,, has only j vacancies. If F; has k members 
where k > j, they will be sorted by using the crowded-comparison operator in descending order of 
the crowding distance. Then, the first j members with larger crowding distance will be selected to 
fill P,,,. The pseudo code and flowchart of the elitism strategy are shown in Figures 24.10 and 24.11, 
respectively. 


© 2011 by Taylor and Francis Group, LLC 


Multiobjective Optimization Methods 24-11 


R=PRUQ // Combine parent and offspring population 
F = fast non-dominated sort (R,) // F = (F,, F,, «) all non-dominated fronts of R, 
Po. = ¢ and i=l 


while IP. + |F,| = N // antil the parent population is filled 

crowding distance // Calculate crowding distance in F, 
assignment (F;) 
Pi. = Pay U Fy // Include ith non-dominated front in the parent 
population 

izsi+l // Check the next front for inclusion 

sort (F;, <,) // Sort in descending order by using <, 

Py = Pyy U Fyfl: (N- |Pial)] // Choose the first (N-|P,,,|) elements of F, 


FIGURE 24.10 Pseudo code of the elitism strategy. 


START 


Crowding distance 
assignment (F;) 


Poy =Ppyy VFL : (N-|P2411)] 


FIGURE 24.11 Flowchart of the elitism strategy. 


24.2.4 Strength Pareto Evolutionary Algorithm 2 


SPEA2 [ZLT01] is an improved version of SPEA [ZT99]. It consists of three main features: (1) strength 
value and raw fitness, (2) density estimation, and (3) archive truncation method. The flowchart of the 
SPEA2 is shown in Figure 24.12. 


24.2.4.1 Strength Value and Raw Fitness 


The strength value of each individual i in the population with size |P| and archive with size |A| is equal 
to the number of solutions that it dominates: 


Si) =i] je P+ Anim jj] (24.7) 


where 
|*| is the cardinality of a set 
“+” is the multiset union 
“>” is the Pareto dominance relation 
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FIGURE 24.12 Flowchart of SPEA2. 


The raw fitness value of each individual i is acquired by summing the strength values of its dominators 
in both population and archive, and is calculated as 


R(i)= > S(j) (24.8) 


JeP+A,jri 


As for a minimization problem, an individual with zero raw fitness value is a non-dominated individual, 
while an individual with high raw fitness value means that it is dominated by many individuals. 


24.2.4.2 Density Estimation 


Even though the raw fitness assignment offers a sorting of niching mechanism on the basis of Pareto 
dominance, this may fail when most individuals do not dominate each other. As a result, the density 
estimation is employed to discriminate between each individual that is having identical raw fitness 
values. 

The density estimation technique used in the SPEA2 is called kth nearest neighbor method [BS86], 
where the density at any point is a decreasing function of the distance to the kth nearest data point. The 
density value of each individual i is given by 


D(i)= 


1 
24.9 
dk +2 ai 


where ds is the desired distance value. 
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‘2’ is biased value added in the denominator to ensure that its value is indeed larger than zero. Thus, it 
is guaranteed that the density value is smaller than 1. The steps of finding the value of d; are as follows: 


1. For each individual i, the distances in objective space to all individuals j in archive and population 
are calculated and stored in a list. 

2. Sort the list in increasing order. 

3. The kth element gives the value of d* where k = |P|+|A| . 


24.2.4.3 Archive Truncation Method 


In each generation, it is necessary to update the archive by copying all non-dominated individuals 
(ie., those having a fitness lower than one) from archive and population to the archive of the next 
generation. If the number of all those non-dominated individuals exceeds the archive size |A|, the 
archive truncation method is used to iteratively remove some non-dominated individuals until 
the number reaches |A]. The merit of this archive truncation method is that the removal of boundary 
non-dominated solutions can be avoided. 

In each generation, an individual iis selected for the removal for which i <,j for allj € A,,, with 


i <p je VO<k<|Aruil:6f = 0) VIO <k <A, ):[(VO<I<k:6)=0)) Ao! <0} | 


where 
G} is the distance of i to its kth nearest neighbor in A,,, 
t is the counter of generation 


That is, the individual having the minimum distance to another individual is chosen at each stage. 
If there are several individuals with the minimum distance, the tie is broken by considering the second 
smallest distances and so forth. 


24.2.5 Pareto Archived Evolution Strategy 


PAES is a local-search-based algorithm [KC09,KC00]. It imitates an algorithm called Evolution Strategy 
and its flowchart is shown in Figure 24.13. In the PAES, the unique genetic operation, mutation, is the 
main hill-climbing strategy, and an archive with a limited size is comprised to store the previously 
found non-dominated solutions. 

The PAES has three versions which are [(1 + 1) - PAES], [(1 + A) - PAES], and [(u + A) - PAES]. The 
first one means that a single parent generates a single offspring. The second one represents a single 
parent for producing A offsprings. The last one means that a population of 1 parents for generating A 
offsprings. By comparison, [(1 + 1) - PAES] has the lower computational overhead (i.e., it is a faster 
algorithm) and also the simplest but the most reliable performer. 

The unique characteristic of the PAES is its adaptive grid scheme. Its notion is the use of a new crowd- 
ing procedure on the basis of recursively dividing up d-dimensional objective space for tracing the 
crowding degrees of various regions of the archive of non-dominated solutions. This way ensures diver- 
sified non-dominated solutions. Also, it helps remove excessive non-dominated solutions located in the 
crowded grids (i-e., the grids with high crowding degree) if the number of those solutions exceeds the 
archive size. 

An adaptive grid scheme is also possible. When each solution is generated, it is necessary to determine 
its grid location in the objective space. Suppose that the range of the space is defined in each objective, 
the required grid location can be acquired by repetitively bisecting the range in each objective in which 
half the solution can be found. The location of the solution is marked with a binary string with a length 
2" where b is the number of bisections of the space for each objective and n is the number of objectives. 
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FIGURE 24.13 Flowchart of PAES. 


g=MAXIMUM 
ITERATION? 


If the solution is located at the larger half of the bisection of the space, the corresponding bit in the 
binary string is set. In order to record the number and which is the non-dominated solutions residing 
in each grid, a map of the grid is therefore maintained throughout the run. Besides, grid locations are 
recalculated when the range of the objective space of archived solutions changes by a threshold amount 
in order to avoid recalculating the ranges too frequently. Only one parameter represents the number of 
divisions of the space required. 


24.2.6 Micro Genetic Algorithm 


Micro genetic algorithm (MICROGA) was suggested by Coello Coello and Toscano Pulido 
[CTP01,CP01]. The flowchart of the MICROGA is shown in Figure 24.14 and it has three peculiarities: 
(1) population memory, (2) adaptive grid algorithm, and (3) three types of elitism. 


24.2.6.1 Population Memory 


The population memory is divided into two parts, which can be replaceable and non-replaceable. The 
replaceable part has some changes after each cycle of the MICROGA. In contrast, the non-replaceable 
part is never changed during the run and aims to provide the required diversity for the algorithm. 

At the beginning of each cycle, the population is taken from both portions of the population memory 
so as to have a mixture of randomly generated individuals (non-replaceable) and evolved individuals 
(replaceable). 
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FIGURE 24.14 Flowchart of MICROGA. 
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24.2.6.2 Adaptive Grid Algorithm 


The adaptive grid algorithm is similar to PAES offering the diversity to non-dominated solutions. 
Once the archive storing the non-dominated solutions reaches its limit, the objective space covered 
by the archive is divided into a number of grids. Then, each solution in the archive is assigned a set of 
coordinates. 

When a new non-dominated solution is generated, it will be accepted ifit is located at a grid where the 
number of stored non-dominated individuals is smaller than that of the most crowded grid, or located 
outside the previously specified boundaries. Note that this adaptive grid algorithm requires two param- 
eters: (1) the expected size of the Pareto front, and (2) the number of positions in which the solution 
space will be divided for each objective. 


24.2.6.3 Types of Elitisms 


In the first type of elitism, the non-dominated solutions found within the internal cycle of the MICROGA 
are stored in case the valuable information obtained from the evolutionary process is not lost. 

The second type of elitism is the nominal solutions (i.e., the best solutions found when the nominal 
convergence is reached) added to the replaceable part of the population memory. This enhances speedily 
converged solutions. It is because of the crossover and mutation that have a higher probability of reach- 
ing the true Pareto front of the problem over time. 

The last type is a certain number of solutions from all the regions of the Pareto front being picked 
uniformly including those in the replaceable part of the population memory. The purpose of this type is 
to utilize the best solutions generated so far right from the starting point as improvement goes (either by 
getting closer to the true Pareto front or by getting a better distribution). 
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24.2.7 Jumping Genes 


This is the latest evolutionary algorithms for solving MO problems [TMC06,TMKO08]. JG merely 
comprises of simple operations in the evolutionary process. But the usual suffocation in terms of 
convergence and diversity for reaching the appropriately solutions can then be greatly alleviated. The 
mimicking of JG phenomenon can fully explore and exploit the solutions space that are considered to be 
as accurate and wide spread along the Pareto-optimal solutions front. 

The very first discovery of JG was reported by the Nobel laureate, Barbara McClintock, based on 
her work on corn plant [FB92]. To emulate the analogy for JG computation, a transposition of gene(s) 
into the same chromosome or even to other chromosomes was devised. As a result, this operation can 
further enhance the genetic operations like crossover, mutation, and selection for improving the fitness 
quality of chromosomes from generation to generation. 

JG comprises of autonomous and nonautonomous transposable elements, called Activator (Ac) 
and Dissociation (Ds) elements. These mobile genetic elements transpose (jump) from one position 
to another position within the same chromosome or even to another chromosome. The difference 
of Ac and Ds is that the transposition occurs itself for Ac, while Ds can only transpose when Ac is 
activated. 

Further, experimental observation also revealed that these jumping genes could move around the 
genome in two different ways. The first one is called cut-and-paste, which means a piece of DNA is cut 
and pasted somewhere else. The second one is known as copy-and-paste, meaning that the genes remain 
at the same location, while the message in the DNA is copied into RNA and then copied back into DNA 
at another place in the genome. 

The classical gene transmission from generation to generation is termed as “vertical transmission” 
(VT), ie., from parent to children. Then, the genes that can “jump” are considered as horizontal trans- 
mission (HT). The genes in HT can jump within the same individual (chromosome) or even from one 
individual to another at the same generation. Then, through genes manipulation, the HT can benefit the 
VT to gain various natural building blocks. However, this process of placing a foreign set of genes into 
the new location is not streamlined, nor can it be planned in advance as natural selection tends to be 
opportunistic, not foresighted. Therefore, the most of genetic takeover, acquisition, mergers, and fusions 
are usually ensued under the conditions of environmental hardship, ie., stress [MS02]. 

In MO, the purpose of fitness functions is to examine the quality of the chromosome through the 
evolutionary process. This will serve as a means of inducing ‘stress’ to the chromosome. Then, the move- 
ment of JG comes into an exploring effect to the chromosome as its genes have been horizontally disturbed 
to create the necessary building blocks to form a new individual (solution). 

Motivated by these biogenetic observations, computational JG operations have been recently proposed 
for the enhancement of the searching ability in genetic algorithms, in particular for MO problems. In order 
to emulate the jumping behavior (transposition process), the following issues are assumed in the design 
and implementation of the JG: 


1. Each chromosome has some consecutive genes, which are randomly selected as the transposon. 
There may be more than one transposon with different lengths (e.g., 1 or 2 bit for binary code). 

2. The jumped locations of the transposons are randomly assigned, and the operations can be car- 
ried on the same chromosome or to another chromosome in the population pool. 

3. Two JG operations, namely cut-and-paste and copy-and-paste, as depicted in Figures 24.15 and 
24.16, respectively, were devised. The actual manipulation of the former operation is that element 
is cut from an original position and pasted into a new position of a chromosome. As for the later 
operation, the element replicates itself; and the copy is inserted into a new location of the chromo- 
some while the original one remains unchanged. 

4. The JG operations are not limited to binary encoded chromosomes, but can also be extended for 
other coding methods, such as integer or real number [RKMO07]. 


© 2011 by Taylor and Francis Group, LLC 


Multiobjective Optimization Methods 24-17 


New insertion Pasted 
Transposon position transposon 
Shift } 


4 
ci La ET Te Tf Te | 


fi 

I 
oc Shift 

I 

e[t{u] petyls)] » Let etel ts 
I 

Transposon New insertion! Pasted 
position | transposon 

I 

(a) Before ———————————————_> After 

Transposon 


afayeTe) CFT e] Before 


New insertion position 


—_—> 


ae ro Te [fle] After 
(b) Pasted transposon 


FIGURE 24.15 Cut-and-paste operation on (a) two different chromosomes and (b) the same chromosome. 


Similar to the crossover and mutation operations, these two jumping gene operations can be integrated 
into any general framework of evolutionary algorithms. However, it has been found that the combina- 
tion of JG with the commonly used sorting scheme [DAM00,DAM02] would create a better perfor- 
mance. A flowchart of JG transposition is shown in Figure 24.17, whereas the flowcharts of the detailed 
operations of cut-and-paste and copy-and-paste are given in Figure 24.18. 

To enhance the search, these two operations are to be inserted after the parent selection process. The 
entire flow diagram is given in Figure 24.19, in which the shaded part is added to the normal flow of a 
classical GA (the design of GA can be referred to [MTK96,TMH96] and the references cited in). This 
common computational flow is now generally known as JGEA. 


24.2.8 Particle Swarm Optimization 


PSO is an evolutionary computation belongs to model-based search technique. It was inspired by the 
social action of searching for food of a flock of bird. Each bird (particle) adjusts its search direction 
for food in accordance to three factors: (1) its own velocity vij(k), (2) its own best previous experience 
(pBest,(k)), and (3) the best experience of all the folks (gBest,(k)). 

Coello et al. [CPL04] firstly extended PSO to the multiobjective problems. The historical records of 
best solutions found by particle(s) (pBest and gBest) are used to store non-dominated solutions generated 
in the past. The particle flies through the problem space with a velocity, which is constantly updated by 
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FIGURE 24.16 Copy-and-paste operation on (a) two different chromosomes and (b) the same chromosome. 
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FIGURE 24.17 Flowchart of JG transposition. 
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FIGURE 24.18 Flowchart of JG (a) cut-and-paste and (b) copy-and-paste operations. 
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FIGURE 24.19 Genetic cycle of JGEA. 
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the particle’s own experience and the best experience of its neighbors in order to locate the optimum 


point iteration by iteration. In each iteration, the velocity and position of each particle are updated as 


follows: 


Velocities: 


Vii (k + 1) = OV; (k) +o; (k)[ pBest, (k) — pos; (k)| 


+ Oyj (k)[ gBest, (k) — pos, (k) |, i=1,2,...,pop; j=l,....n (24.10) 


START 
Initialize parameters of PSO, population and maximum generation 


Randomly generate initial positions and velocities of particles 
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Update storage: personal best and global best 


Get new positions and velocities of particles 
Save non-dominated solutions 


FIGURE 24.20 Flowchart of PSO. 
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Positions: 


pos(k+1)= pos,(k)+vj(k+1), i=1,2,...,pop; j=l,....n (24.11) 


where 
w is inertia weight 
c, is the cognition factor 
c, is social-learning factor 
1 (k) = rk) ~ UO, 1) 
pop is the population size 
v;(k) and pos,(k) are velocity and position vectors 
k is time step 


The appropriate values of @, c, and c, were studied in [VE06]. The nominal values are @ = 0.4 and 
c, = C, = 1.5, for most general applications. The flowchart of multiobjective PSO (MOPSO) is shown in 
Figure 24.20; whereas the PSO pseudo-code is listed in Figure 24.21. In general, PSO tends to suffer in 
diversity but performs faster in convergence as the obtained solutions are seemingly clustered to a small 
region. 


Particle Swarm Optimization 
[popParticle, MAXGEN, w cl, c2] = set initial parameters of PSO 
initial para(); 


Pareto+ ]; noveParetoy J]; Fits |]; 
Note: n is dimension of solution 


space 
posi tion initial Pop( popParticle, n) generate initial positions of 
particles 
velocity =initial Vn); initialize velocities of particles 
for a=l: MAXGEN 
for b=1: popParticle 
Obj Val = si mposi ti on(b) ); // performsystem simulation and 


compute 
// the obj ectives of a particle. 
Fit=+ Fit; Obj Val]; // save the fitness 
end 
popRank=Rank( Fit, move); // ranking solutions 
[pBest, gBest]= // find and save personal best & 
findBest(Fit, position); // global best 
Update (position, velocity) // update the velocities and the 
positions 
// of the population 
Pareto = // get the non-dominated solutions 
pareto(Pareto, popRank, Fit, position, popParti cle); 
end 


FIGURE 24.21 Pseudocode for PSO. 
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24.3 Concluding Remarks 


A historical account for MO has been introduced in this chapter. Although, due to the constraint of 
page limit, there is no specific engineering design in the example given here, nonetheless, the pros and 
cons of each recommended MO scheme were outlined. The computational procedures, organization 
flow charts, and the necessary equations of each scheme were vividly described and stated. The material 
should be adequate enough for both experienced scientists and engineers to follow, and for beginners in 
the field to absorb the knowledge for their practical uses. 

It has been generally recognized that the conundrum of applying MO for engineering designs is 
largely hindered to the conflicting requirements of rate of convergence and diversity of obtained non- 
dominated solutions that can lead to or close to the ultimate Pareto-optimal front. A thorough study 
[TMKO08] has provided an accountable investigation in this respect amongst the popular schemes. On 
the other hand, the PSO method [CPL04] has gained considerable attention in the recent years claiming 
that a quicker rate of convergence could be achieved. 

Nonetheless, the incorporated long list of relevant references has provided a decent platform for 
further enhancing the transparence of MO techniques and serves the purpose of offering a quick and 
easy access to this intriguing technology that were once considered to be a far cry for computational 
optimization community. 
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The solution of optimization problems having two or more (often conflicting) criteria has become rela- 
tively common in a wide variety of application areas. Such problems are called “multiobjective,” and 
their solution has raised an important amount of research within operations research, particularly in 
the last 35 years [52]. In spite of the large number of mathematical programming methods available for 
solving multiobjective optimization problems, such methods tend to have a rather limited applicability 
(e.g., when dealing with differentiable objective functions, or with convex Pareto fronts). This has moti- 
vated the use of alternative solution approaches such as evolutionary algorithms. 

The use of evolutionary algorithms for solving multiobjective optimization problems was origi- 
nally hinted at in the late 1960s [65], but the first actual implementation of a multiobjective evolution- 
ary algorithm (MOEA) was not produced until 1985 [68]. However, this area, which is now called 
“evolutionary multiobjective optimization,” or EMO, has experienced a very important growth, 
mainly in the last 15 years [8,15]. 

This chapter presents a basic introduction to EMO, focusing on its main concepts, the most popular 
algorithms in current use, and some of its applications. The remainder of this chapter is organized as 
follows. In Section 25.1, we provide some basic concepts from multiobjective optimization. The use of 
evolutionary algorithms in multiobjective optimization is motivated in Section 25.2. Some of the main 
topics of research that are currently attracting a lot of attention in the EMO field are briefly discussed in 
Section 25.3. A set of sample applications of MOEAs is provided in Section 25.4. Some of the main topics 
of research in the EMO field that currently attract a lot of attention are briefly discussed in Section 25.5. 
Finally, some conclusions are provided in Section 25.6. 


* The author is also associated to the UMI-LAFMIA 3175 CNRS. 
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25.1 Basic Concepts 


We are interested in the solution of multiobjective optimization problems (MOPs) of the form 
minimize| fi(X), fil), -.-. fe(%) | (25.1) 


subject to the m inequality constraints: 

g(x) <0 i=1,2,....m (25.2) 
and the p equality constraints: 

h(x) =0 i=1,2,...,p (25.3) 


where k is the number of objective functions f;; R" > R. We call x = [x,, x2, ..., x,]7 the vector of decision 
variables. We wish to determine from the set F of all vectors which satisfy (25.2) and (25.3) the particu- 
lar set of values x, Xp sie Xx, which yield the optimum values of all the objective functions. 


25.1.1 Pareto Optimality 


It is rarely the case that there is a single point that simultaneously optimizes all the objective functions.* 
Therefore, we normally look for “trade-offs,” rather than single solutions when dealing with multiobjec- 
tive optimization problems. The notion of “optimality” normally adopted in this case is the one origi- 
nally proposed by Francis Ysidro Edgeworth [20] and later generalized by Vilfredo Pareto [57]. Although 
some authors call this notion Edgeworth-Pareto optimality, we use the most commonly adopted term 
Pareto optimality. 

We say that a vector of decision variables x* € F is Pareto optimal if there does not exist another x € F 
such that f(x) < f,(x*) for alli=1,...,k and f,(x) < f(x") for at least one j (assuming minimization). 

In words, this definition says that x«* is Pareto optimal if there exists no feasible vector of decision 
variables x € F which would decrease some criterion without causing a simultaneous increase in at 
least one other criterion. It is worth noting that the use of this concept normally produces a set of solu- 
tions called the Pareto optimal set. The vectors x* corresponding to the solutions included in the Pareto 
optimal set are called nondominated. The image of the Pareto optimal set under the objective functions 
is called Pareto front. 


25.2 Use of Evolutionary Algorithms 


The idea of using techniques based on the emulation of the mechanism of natural selection 
(described in Darwin’s evolutionary theory) to solve problems can be traced back to the early 1930s 
[25]. However, it was not until the 1960s that the three main techniques based on this notion were 
developed: genetic algorithms [35], evolution strategies [70], and evolutionary programming [26]. 


* In fact, this situation only arises when there is no conflict among the objectives, which would make unnecessary the 
development of special solution methods, since this single solution could be reached after the sequential optimization of 
all the objectives, considered separately. 
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These approaches, which are now collectively denominated “evolutionary algorithms,” have been 
very effective for single-objective optimization [27,30,71]. 

The basic operation of an evolutionary algorithm (EA) is the following. First, they generate a set of 
possible solutions (called a “population”) to the problem at hand. Such a population is normally gen- 
erated in a random manner. Each solution in the population (called an “individual”) encodes all the 
decision variables of the problem. In order to assess their suitability, a fitness function must be defined. 
Such a fitness function is a variation of the objective function of the problem that we wish to solve. Then, 
a selection mechanism must be applied in order to decide which individuals will “mate.” This selection 
process is normally based on the fitness contribution of each individual (ie., the fittest individuals have 
a higher probability of being selected). Upon mating, a set of “offspring” are generated. Such offspring 
are “mutated” (this operator produces a small random change, with a low probability, on the contents 
of an individual), and constitute the population to be evaluated at the following iteration (called a “gen- 
eration”). This process is repeated until reaching a stopping condition (normally, a maximum number 
of generations). 

EAs are considered a good choice for solving multiobjective optimization problems because they 
adopt a population of solutions, which allows them (if properly manipulated) to find several elements of 
the Pareto optimal set in a single run. This contrasts with mathematical programming methods, which 
normally generate a single nondominated solution per execution. Additionally, EAs tend to be less sus- 
ceptible to the discontinuity and the shape of the Pareto front, which is an important advantage over 
traditional mathematical programming methods [21]. 

Multiobjective evolutionary algorithms (MOEAs) extend a traditional evolutionary algorithm in two 
main aspects: 


e The selection mechanism. In this case, the aim is to select nondominated solutions, and to con- 
sider all the nondominated solutions in a population as equally good. 

« A diversity maintenance mechanism. This is necessary to avoid convergence to a single solution, 
which is something that will eventually happen with an EA (because of stochastic noise) if run for 
a sufficiently long time. 


Regarding selection, although in their early days, several MOEAs relied on aggregating functions [34] 
and relatively simple population-based approaches [68], today, most of them adopt some form of Pareto 
ranking. This approach was originally proposed by David E. Goldberg [30], and it sorts the population of 
an EA based on Pareto dominance, such that all nondominated individuals are assigned the same rank 
(or importance). The aim is that all nondominated individuals get the same probability of being selected, 
and that such probability is higher than the one corresponding to individuals which are dominated. 
Although conceptually simple, this sort of selection mechanism allows for a wide variety of possible 
implementations [8,15]. 

A number of methods have been proposed in the literature to maintain diversity in an EA. Such 
approaches include fitness sharing and niching [16,32], clustering [78,84], geographically based 
schemes [42], and the use of entropy [12,39], among others. Additionally, some researchers have pro- 
posed the use of mating restriction schemes [72,84]. Furthermore, the use of relaxed forms of Pareto 
dominance has also become relatively popular in recent years, mainly as an archiving technique 
which encourages diversity, while allowing the archive to regulate convergence (see for example, 
€-dominance [45]). 

A third component of modern MOEAs is elitism, which normally consists of using an external 
archive (called a “secondary population”) that can (or cannot) interact in different ways with the main 
(or “primary”) population of the MOEA. The main purpose of this archive is to store all the nondomi- 
nated solutions generated throughout the search process, while removing those that become dominated 
later in the search (called local nondominated solutions). The approximation of the Pareto optimal set 
produced by a MOEA is thus the final contents of this archive. 
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25.3 Multiobjective Evolutionary Algorithms 


Despite the considerable volume of literature on MOEAs that is currently available,* very few algo- 
rithms are used by a significant number of researchers around the world. The following are, from the 
author's perspective, the most representative MOEAs in current use: 


1. Strength Pareto Evolutionary Algorithm (SPEA): This MOEA was conceived as the merge of sev- 
eral algorithms developed during the 1990s [84]. It adopts an external archive (called the external 
nondominated set), which stores the nondominated solutions previously generated, and partici- 
pates in the selection process (together with the main population). For each individual in this 
archive, a strength value is computed. This strength is proportional to the number of solutions 
which a certain individual dominates. In SPEA, the fitness of each member of the current popula- 
tion is computed according to the strengths of all external nondominated solutions that dominate 
it. Since the external nondominated set can grow too much, this could reduce the selection pro- 
cess and could slow down the search. In order to avoid this, SPEA adopts a technique that prunes 
the contents of the external nondominated set so that its size remains below a certain (predefined) 
threshold. For that sake, the authors use a clustering technique. 

2. Strength Pareto Evolutionary Algorithm 2 (SPEA2): This approach has three main differences 
with respect to its predecessor [83]: (1) it incorporates a fine-grained fitness assignment strategy 
which takes into account for each individual the number of individuals that dominate it and the 
number of individuals by which it is dominated; (2) it uses a nearest-neighbor density estimation 
technique which guides the search more efficiently, and (3) it has an enhanced archive truncation 
method that guarantees the preservation of boundary solutions. 

3. Pareto Archived Evolution Strategy (PAES): This is perhaps the most simple MOEA than one can 
conceive, and was introduced by Knowles and Corne [44]. It consists of a (1 + 1) evolution strategy 
(i.e., a single parent that generates a single offspring) in combination with a historical archive that 
stores the nondominated solutions previously found. This archive is used as a reference set against 
which each mutated individual is being compared. Such (external) archive adopts a crowding pro- 
cedure that divides objective function space in a recursive manner. Then, each solution is placed 
ina certain grid location based on the values of its objectives (which are used as its “coordinates” 
or “geographical location”). A map of such a grid is maintained, indicating the number of solu- 
tions that reside in each grid location. When a new nondominated solution is ready to be stored in 
the archive, but there is no room for them (the size of the external archive is bounded), a check is 
made on the grid location to which the solution would belong. If this grid location is less densely 
populated than the most densely populated grid location, then a solution (randomly chosen) from 
this heavily populated grid location is deleted to allow the storage of the newcomer. This aims to 
redistribute solutions, favoring the less densely populated regions of the Pareto front. Since the 
procedure is adaptive, no extra parameters are required (except for the number of divisions of the 
objective space). 

4. Nondominated Sorting Genetic Algorithm II (NSGA-II): This is a heavily revised version of the 
nondominated sorting genetic algorithm (NSGA), which was introduced in the mid-1990s [74]. 
The NSGA-II adopts a more efficient ranking procedure than its predecessor. Additionally, it esti- 
mates the density of solutions surrounding a particular solution in the population by comput- 
ing the average distance of two points on either side of this point along each of the objectives of 
the problem. This value is the so-called crowding distance. During selection, the NSGA-II uses 
a crowded-comparison operator which takes into consideration both the nondomination rank 
of an individual in the population and its crowding distance (i-e., nondominated solutions are 


* The author maintains the EMOO repository, which, as of December 2008, contains over 3600 bibliographic references 
on evolutionary multiobjective optimization. The EMOO repository is available at: http://delta.cs.cinvestav.mx/~ccoello/ 
EMOO/ 
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preferred over dominated solutions, but between two solutions with the same nondomination 
rank, the one that resides in the less crowded region is preferred). The NSGA-II does not use an 
external archive as most of the modern MOEAs in current use. Instead, the elitist mechanism 
of the NSGA-II consists of combining the best parents with the best offspring obtained (i.e., a 
(u + A)-selection). Due to its clever mechanisms, the NSGA-II is much more efficient (computa- 
tionally speaking) than its predecessor, and its performance is so good that it has become very 
popular in the last few years, triggering a significant number of applications, and becoming some 
sort of landmark against which new MOEAs have to be compared in order to merit publication. 

5. Pareto Envelope-Based Selection Algorithm (PESA): This algorithm was proposed by Corne et al. [11], 
and uses a small internal population and a larger external (or secondary) population. PESA adopts 
the same adaptive grid from PAES to maintain diversity. However, its selection mechanism is 
based on the crowding measure used by the aformentioned grid. This same crowding measure 
is used to decide what solutions to introduce into the external population (i.e., the archive of 
nondominated vectors found along the evolutionary process). Therefore, in PESA, the external 
memory plays a crucial role in the algorithm since it determines not only the diversity scheme, 
but also the selection performed by the method. There is also a revised version of this algorithm, 
called PESA-II [10], which is identical to PESA, except for the fact that region-based selection is 
used in this case. In region-based selection, the unit of selection is a hyperbox rather than an indi- 
vidual. The procedure consists of selecting (using any of the traditional selection techniques [31]) 
a hyperbox and then randomly selecting an individual within such hyperbox. The main motiva- 
tion of this approach is to reduce the computational costs associated with traditional MOEAs 
(i.e., those based on Pareto ranking). 


Many other MOEAs have been proposed in the specialized literature (see for example [9,17,81]), but they will 
not be discussed here due to obvious space limitations. A more interesting issue, however, is to devise which 
sort of MOEA will become predominant in the next few years. Efficiency is, for example, a concern nowadays, 
and several approaches have been developed in order to improve the efficiency of MOEAs (see for example 
[37,41]). There is also an interesting trend consisting on designing MOEAs based on a performance measure 
(see for example [3,82]). However, no clear trend exists today, from the author's perspective, that seems to 
attract the interest of a significant portion of the EMO community, regarding algorithmic design. 


25.4 Applications 


Today, there exists a very important volume of applications of MOEAs in a wide variety of domains. 
Next, we provide a brief list of sample applications classified in three large groups: engineering, indus- 
trial, and scientific. Specific areas within each of these large groups are also identified. 

By far, engineering applications are the most popular in the current EMO literature. This is not sur- 
prising if we consider that engineering disciplines normally have problems with better understood 
mathematical models, which facilitates the use of MOEAs. A representative sample of engineering 
applications is the following: 


¢ Electrical engineering [1,63] 

¢ Hydraulic engineering [51,62] 

¢ Structural engineering [56,58] 

¢ Aeronautical engineering [38,47] 
¢ Robotics [2,77] 

¢ Control [5,6] 

e Telecommunications [59,75] 

¢ Civil engineering [22,36] 

¢ Transport engineering [50,73] 
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Industrial applications are the second most popular in the EMO literature. A representative sample of 
industrial applications of MOEAs is the following: 


¢ Design and manufacture [19,33] 
¢ Scheduling [29,40] 
« Management [53,64] 


Finally, there are several EMO papers devoted to scientific applications. For obvious reasons, computer 
science applications are the most popular in the EMO literature. A representative sample of scientific 
applications is the following: 


« Chemistry [60,76] 

e Physics [61,67] 

e Medicine [24,28] 

¢ Computer science [14,48] 


This sample of applications should give at least a rough idea of the increasing interest of researchers for 
adopting MOEAs in practically all types of disciplines. 


25.5 Current Challenges 


‘The existence of challenging, but solvable problems, is a key issue to preserve the interest in a research 
discipline. Although EMO is a discipline in which a very important amount of research has been con- 
ducted, mainly within the last 10 years, several interesting problems still remain open. Additionally, the 
research conducted so far has also led to new, and intriguing topics. The following is a small sample of 
open problems that currently attract a significant amount of research within EMO: 


¢ Scalability: In spite of the popularity of MOEAs in a plethora of applications, it is known that 
Pareto ranking is doomed to fail as we increase the number of objectives, and it is also known 
that with about 10 objectives, it behaves like random sampling [43]. The reason is that most of the 
individuals in a population will become nondominated, as the number of objectives increases. 
In order to deal with this problem, researchers have proposed selection schemes different from 
Pareto ranking [18,23], as well as mechanisms that allow to reduce the number of objectives of a 
problem [4,49]. However, there is still a lot of work to be done in this regard, and this is currently 
a very active research area. 

¢ Incorporation of user’s preferences: It is normally the case, that the user does not need the entire 
Pareto front of a problem, but only a certain portion of it. For example, solutions lying at the 
extreme parts of the Pareto front are unlikely necessary since they represent the best value for 
one objective, but the worst for the others. Thus, if the user has at least a rough idea of the sort of 
trade-offs that aims to find it is desirable to be able to explore in more detail only the nondomi- 
nated solutions within the neighborhood of such trade-offs. This is possible, if we use, for example, 
biased versions of Pareto ranking [13] or some multi-criteria decision making technique, from 
the many developed in Operations Research [7]. Nevertheless, this area has not been very actively 
pursued by EMO researchers, in spite of its usefulness. 

¢ Parallelism: Although the use of parallel MOEAs is relatively common in certain disciplines such 
as aeronautical engineering [55], the lack of serious research in this area is remarkable [8,79]. 
Thus, it is expected to see much more research around this topic in the next few years, for exam- 
ple, related to algorithmic design, the role of local search in parallel MOEAs and convergence 
analysis, among others. 

¢ Theoretical Foundations: Although an important effort has been made in recent years to develop 
theoretical work related to MOEAs, in areas such as convergence [66,80], archiving [69], algorithm 
complexity [54], and run-time analysis [46], a lot of work still remains to be done in this regard. 


© 2011 by Taylor and Francis Group, LLC 


Fundamentals of Evolutionary Multiobjective Optimization 25-7 


25.6 Conclusions 


In this chapter, we have provided some basic concepts related to evolutionary multiobjective optimiza- 
tion, as well as a short description of the main multiobjective evolutionary algorithms in current use. 
The main application areas of such algorithms have also been included, in order to provide a better idea 
of their wide applicability and of the increasing interest to use them. 

In the last part of the chapter, we provided a short discussion of some challenging topics that are cur- 
rently very active within this research area. The main objective of this chapter is to serve as a general 
(although brief) overview of the EMO field. Its main aim is to motivate researchers and newcomers from 
different areas, who have to deal with multiobjective problems, to consider MOEAs as a viable choice. 
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26.1 Introduction 


Many practical problems in logistics, planning, design, engineering, biology, and other fields can be 
modeled as optimization problems, where the goal is the minimization (respectively, maximization) of 
a particular objective function. The objective function assigns an objective cost value to each possible 
candidate solution. The domain of the objective function is called the search space, which may be either 
discrete or continuous. Optimization problems with discrete search space are also called combinatorial 
optimization problems (COPs). In principle, the purpose of an optimization algorithm is to find a 
solution that minimizes the objective function, that is, an optimal solution. 

When dealing with optimization problems, algorithms that guarantee to find an optimal solution 
within bounded time are called complete algorithms. Nonetheless, some optimization problems may 
be too large or complex for complete algorithms to solve. In particular, there exists a class of problems, 
known as NP-hard, for which it is generally assumed that no complete algorithm with polynomial run- 
ning time will ever be found. However, many practical situations only require to find a solution that, 
despite not being the optimal, is “good enough,” specially if such solution is found within a reasonable 
computation time. This compromise explains the interest in the development of approximate (incom- 
plete) algorithms, which aim to find a solution with objective cost close to the optimal within a compu- 
tation time much shorter than any complete algorithm. 

Ant colony optimization (ACO) [26] was one of the first techniques for approximate optimization 
inspired by the collective behavior of social insects. From the perspective of operations research, ACO 
algorithms belong to the class of metaheuristics [11,33,38]. At the same time, ACO algorithms are part of 
a research field known as swarm intelligence [10]. ACO takes its inspiration from the foraging behavior 
of ant colonies. At the core of this behavior is the indirect communication between the ants by means of 
chemical pheromone trails, which enables them to find short paths between their nest and food sources. 
This characteristic of real ant colonies is exploited in ACO algorithms for solving combinatorial and 
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continuous optimization problems. In this introductory chapter, we focus, for space reasons, on com- 
binatorial optimization, which is the traditional field of application of ACO. The interested reader may 
find comprehensive information on ACO algorithms for continuous optimization in the work of Socha 
and Dorigo [66]. 

The motivation of this chapter is to introduce the basic concepts of ACO without presuming any elabo- 
rate knowledge of computer science. Toward this goal, Section 26.2 briefly defines what a COP is. Section 
26.3 introduces the concepts of complete versus approximate optimization algorithms and the definition of 
metaheuristic. Section 26.4 covers the origin and fundamental concepts of the ACO metaheuristic. Section 
26.5 provides examples of modern ACO algorithms and their characteristics. Section 26.6 describes how 
ACO algorithms have been combined with other optimization techniques to further improve performance. 
Finally, Section 26.7 offers a representative sample of the wide range of applications of ACO algorithms. 


26.2 Combinatorial Optimization Problems 


In computer science, an optimization problem is composed of a search space S defined over a finite set of 
decision variables x; (i = 1,...,n), a set of constraints Q among these variables, and an objective function 
f: S  R* that assigns a cost value to each element of S. Elements of the search space S are called can- 
didate solutions to the problem. Candidate solutions that do not satisfy the constraints in © are called 
infeasible. COPs are those where the domains of the decision variables are discrete, hence the search 
space is finite. Nonetheless, the search space of a COP may be too large to be enumerated. 

The definition of an optimization problem usually contains several parameters that are not fully 
specified, such as the number n of decision variables. An instance of a particular optimization problem 
is one possible specification of all parameters of an optimization problem, excluding the decision vari- 
ables themselves. Given an instance of an optimization problem, the goal is to find a feasible solution, 
which minimizes the objective function.* 

A notable example of a COP is the travelling salesman problem (TSP). In the classical definition of 
the TSP, the goal is to find the shortest route that traverses a given number of cities just once and returns 
to the origin. This problem is generalized as the problem of finding the minimal Hamiltonian cycle 
on a completely connected graph with weighted arcs. In this case, an instance of the TSP problem is 
defined by a particular graph, typically given as a distance matrix between nodes. The difficulty of a TSP 
instance greatly depends on the number of nodes in the graph. Despite its simplicity, the TSP represents 
various real-world problems. 

Given an instance of an optimization problem, the goal of an optimization algorithm is to find an 
optimal solution, that is, a feasible solution that minimizes the objective function. 


26.3 Optimization Algorithms 


Ideally, an optimization algorithm for a given problem should be able to find an optimal solution for any 
problem instance. When this is the case, such algorithm is called complete (or exact). A trivial example of a 
complete algorithm is exhaustive search, where all possible solutions are iteratively enumerated and evaluated. 

In some cases, for example, when the problem is trivial enough or the instance size is small, 
complete algorithms are able to find the optimal solution in a reasonable amount of time. For many 
interesting problems, however, the computation time required by complete algorithms is excessively 
large. In particular, there is a class of COPs—called NP-hard—for which it is generally assumed that 
no complete algorithm that requires a polynomial computation time will ever be found. The TSP 
is an example of an NP-hard problem and, hence, any complete algorithm will require exponential 
time with respect to the instance size. 


* The rest of the chapter will concern only minimization problems, without loss of generality, because maximizing over 
an objective function fis the same as minimizing over —f. 
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Approximate (or heuristic) algorithms, on the other hand, do not guarantee returning the optimal 
solution. Instead, approximate algorithms aim to find a good approximation to the optimal solution 
in a short time. Simple heuristics use some kind of rule of thumb specific to the problem at hand. In 
the case of the TSP, an often used rule of thumb concerns the selection of short arcs. By comparison, 
more sophisticated metaheuristics are approximate algorithms that can be adapted to different problems 
[11,33,38]. Examples of metaheuristics are simulated annealing, iterated local search, tabu search, evo- 
lutionary algorithms, and ACO. 


26.4 Ant Colony Optimization 


The first ACO algorithm, Ant System (AS) [22,25], was inspired by the observation that ant colonies are 
able to find short paths between food sources and their nests. Ants searching for food explore the environ- 
ment around their nest in a seemingly random wandering. As they move around, ants deposit a chemical 
substance, called pheromone. At the same time, ants are influenced by the pheromone present in their 
surroundings, having a probabilistic tendency to follow the direction where the concentration of phero- 
mone is stronger. An ant that finds a food source returns to the nest, laying on the ground on its way back 
an amount of pheromone that depends on the quality of the food source. In this manner, an ant is able to 
indicate to other ants from the same colony the location and quality of the food source without any direct 
communication. Experiments on real ants [20] have shown that this indirect coordination between ants 
via pheromone trails—a mechanism generally known as stigmergy—produces a self-organizing collective 
behavior, where shorter paths between their nest and food sources are progressively followed by more ants, 
reinforcing pheromone trails of successful routes and eventually finding the shortest path. 

ACO is a term that designates the number of algorithms based on this collective behavior of real ant 
colonies. When moving from real ants to artificial ants, we must bear in mind that the goal of ACO 
is not to faithfully model the natural world, but rather to find a good approximation to the unknown 
optimal solution of a given COP. In the context of COPs, pheromone values are probabilities associ- 
ated with solution components. Solution components are elementary units that are assembled to form 
complete solutions to the particular problem. The task of each artificial ant consists in selecting solution 
components to construct a feasible candidate solution to the particular problem. Other notions from the 
natural world, such nest and food source, do not need an equivalent in ACO. 

The basic elements of an ACO algorithm are graphically described by Figure 26.1. When applying 
an ACO algorithm to a particular COP, the first step is to define a finite set C = {c,, c2, ...} of solution 
components, such that complete solutions may be constructed by iteratively selecting elements from this 
set. Second, one has to define a set T of pheromone values, which are numerical values associated with 
solution components: dc, € C, Vt; € T. In the example of the TSP, each edge of the given graph may be 
considered a solution component, where 7, represents the pheromone information associated with the 


Combinatorial Probabilistic 
optimization solution 


problem construction 


FIGURE 26.1 The fundamental components of the ACO framework. 
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Algorithm 26.1 Ant Colony Optimization (ACO) 


1: Initialization() 


2: while termination criteria not met do 


3: for each antaeé {I,..., N*} do 

4 AntBasedSolutionConstruction() /* see Algorithm 2 */ 
5 end for 

6: PheromoneUpdate() 

7 DaemonActions() /* optional */ 

8: end while 


edge connecting nodes i and j. An alternative definition considers solution components as assignments 
of nodes to absolute positions in the tour, and hence, pheromone information 1, would be associated 
with visiting node j as the ith node of the tour. The former approach has repeatedly shown to produce 
better results for the TSP, but the opposite is true for other problems. In fact, the study of different 
pheromone representations is an ongoing research subject [57]. 

Pheromone values (implicitly) define a probability distribution over the search space. Artificial 
ants assemble complete candidate solutions by probabilistically choosing a sequence of solution com- 
ponents. In addition, pheromone values are modified taking into account the quality of the candidate 
solutions in order to bias future construction of solutions toward high-quality solutions. This cycle 
of probabilistic solution construction and pheromone update conforms the two fundamental steps 
of algorithms based on the ACO metaheuristic. Hence, most ACO algorithms follow the algorithmic 
schema shown in Algorithm 26.1, where three main procedures, AntBasedSolutionConstruction(), 
PheromoneUpdate(), and DaemonActions(), are performed at each iteration of the algorithm. 
Hereby, the order of execution of these procedures is up to the algorithm designer. From the algorith- 
mic point of view, AntBasedSolutionConstruction() and PheromoneUpdate() are the two basic 
operations that must be defined by any ACO algorithm and we describe them in more detail in the 
following sections. The optional procedure DaemonActions() denotes any additional operation that 
is not related to solution construction or pheromone update, such as applying local search. 


26.4.1 Solution Construction 


At every iteration of ACO, a number of artificial ants construct complete solutions following procedure 
AntBasedSolutionConstruction() (Algorithm 26.2). Each artificial ant assembles a complete solution 
by probabilistically choosing a sequence of elements from the set C of solution components. Each ant 
starts with an empty sequence s = <>. At each construction step, the current sequence s is extended by 
adding a solution component from the set M(s) C. The set M(s) is defined such that the extension of the 
partial solution s may still result in a valid solution for the problem under consideration. For example, 
in the case of the TSP, N(s) is the set of nodes not visited yet in partial tour s. The choice of a solution 
component c,€ Ns) is performed probabilistically with respect to the pheromone information 1; € T. 
In most ACO algorithms, the transition probabilities are defined as follows: 


Algorithm 26.2 Procedure AntBasedSolutionConstruction() 


1: s =<> /* start with an empty solution */ 
2: repeat 

3: determine Ms) 

4: c <— ChooseFrom(Ms) )- 

5: s<sU {c}/* extend s with solution component c */ 


6: until s is a complete solution 
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or 7B 
ays eae (26.1) 


where 1, is the optional heuristic information, which assigns a heuristic value , to each solution 
component c; € N(s). The specification of the heuristic information depends on the problem. In 
the most simple case, the heuristic information depends only on the solution components. More 
sophisticated versions also consider the current partial solution. In general, its calculation should 
be fast, typically a constant value or a computationally inexpensive function. Finally, the exponents 
o and B determine the relative influence of pheromone and heuristic information on the resulting 
probabilities. 


26.4.2 Pheromone Update 


The pheromone information is modified in order to increase the probability of constructing better 
solutions in future iterations. Normally, this is achieved by performing both reinforcement—increas- 
ing the pheromone values associated with some solution components—and evaporation. Pheromone 
evaporation uniformly decreases all pheromone values, effectively reducing the influence of previous 
pheromone updates in future decisions. This has the effect of slowing down convergence. An algorithm 
has converged when subsequent iterations repeatedly construct the same solutions, and hence, further 
improvement is unlikely. Premature convergence prevents the adequate exploration of the search space 
and leads to poor solution quality. On the other hand, excessive exploration prevents the algorithm from 
quickly obtaining good solutions. 

Pheromone reinforcement is performed by, first, selecting one or more solutions from those already con- 
structed by the algorithm, and second, by increasing the values of the pheromone information associated 
with solution components that are part of these solutions. A general form of the pheromone update would be 


1; —(1-p)-1; +p- > At(s) Vt, eT, (26.2) 


{s€Supa lei es} 


where 
p € (0, 1) is a parameter called evaporation rate 
Supa denotes the set of solutions that are selected for updating the pheromone information 
At(s) is the pheromone amount deposited by each ant, which may be a function of the objective 
value of s 


The only requirement is that At(s) is nonincreasing with respect to the value of the objective function, 
that is, f(s) < f(s’) = At(s) 2 At(s’), Vs # s’ € S. In the simplest case, At(s) may be a constant value. 

ACO algorithms often differ in the specification of S,,,. In most cases, S,,,, is composed of some of 
the solutions generated in the respective iteration (denoted by S;,.,) and the best solution found since 
the start of the algorithm, the best-so-far solution, henceforth denoted by s,,. For example, AS selects 
for update all solutions constructed in the latest iteration, S,,,, = Si. Other successful alternatives are 
the iteration-best and best-so-far update strategies, respectively ib-update and bf-update. The ib-update 
strategy utilizes only the best solution from the current iteration, that is, S,,4 = {s,,}, where s,, = arg min 
{f(s)|s © Si. The ib-update rule focuses the search on the best solutions found in recent iterations. 
Similarly, the bf-update strategy utilizes only the best-so-far solution (s,,), which produces an even faster 
convergence. In practice, the most successful ACO variants use variations of ib-update or bf-update 


rules and additionally include mechanisms to avoid premature convergence. 
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TABLE 26.1 Selection of Successful ACO Variants 


ACO Variant Authors Main Reference 

Elitist AS Dorigo [22] 
Dorigo, Maniezzo, and Colorni [25] 

Rank-based AS Bullnheimer, Hartl, and Strauss [16] 

MAX-MIN Ant System(MMAS) — Stitzle and Hoos [70] 

Ant Colony System Dorigo and Gambardella [23] 

Hyper-Cube Framework Blum and Dorigo [9] 


26.5 Modern ACO Algorithms 


Despite proving the feasibility of a discrete optimization algorithm based on ants’ foraging behavior, the 
results of AS—the first ACO algorithm—were inferior to state-of-the-art algorithms for the TSP. Hence, 
several improved variants have been proposed over time. Table 26.1 enumerates the most prevalent ACO 
variants, and we give a brief summary of their characteristics. 

The variants elitist ant system (EAS) [25] and rank-based ant system (RAS) [16] mainly differ in the 
update method. EAS updates pheromone values using all solutions constructed in the current itera- 
tion plus the best-so-far solution (Siy4 = Siter U {Spe}) hereby assigning a larger pheromone amount to 
solution s,;. RAS considers for update a limited number m - 1 of solutions from the current iteration 
plus s,;. These solutions are ranked according to their objective value, and the highest ranked solution 
s contributes an amount of pheromone of m - At(s), the one with the second-best rank s’ contributes 
(m — 1) - At(s’), and so on for all m solutions. 

MAX-MIN Ant System (MMAS) [70] and Ant Colony System (ACS) [23]—perhaps the most 
successful ACO variants nowadays—introduced more sophisticated characteristics. In MMAS, 
bounds of the pheromone values are dynamically calculated to avoid premature convergence and favor 
exploration of new solutions. In addition, MMM.AS uses a combination of ib-update and bf-update rules 
in order to focus the search on the (seemingly) most promising solution components. 

ACS adds a greedy alternative to the probabilistic solution construction described by Equation 26.1. 
When choosing the next solution component, an ant has a certain probability of q, of choosing the solution 
component that maximizes [t,]*- [1n,]°, otherwise the ant chooses probabilistically following Equation 26.1. 
A higher value of q, focuses the search around the best-solution components, accelerating convergence. 
ACS follows a bf-update rule; however, evaporation is applied only to the pheromone values associated with 
the best-so-far solution s,; Finally, ACS includes a local pheromone update that is performed after each 
solution construction step and decreases the pheromone values of solution components already visited. 

A different proposal is the hyper-cube framework (HCF) [9], which is a framework for implementing 
ACO algorithms, including the ones mentioned above. The HCF has allowed to obtain theoretical results 
about the convergence of ACO. The practical benefits of the HCF include limiting pheromone values to 
the interval [0, 1], and the pheromone update is not affected by the scale of the objective function values. 


26.6 Extensions of the ACO Metaheuristic 


The ACO metaheuristic described above has been frequently complemented with additional algorithmic fea- 
tures. In particular, the combination with local search has been a standard approach, where local search can be 
considered a step to refine the solutions constructed by ACO before using them for pheromone update. Another 
notable example is the candidate list strategy [29], where the number of available choices at each solution con- 
struction step is restricted to a set of best choices. The set of best choices is usually selected with respect to their 
transition probabilities (Equation 26.1). The rationale for this approach is that for the construction of high-qual- 
ity solutions, it is often enough to consider only the promising choices at each construction step. Moreover, for 
large instances with many solution components, limiting the number of choices greatly speeds up the search. 
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The ideas behind ACO have also been integrated into hybrid techniques. Hybrid metaheuristics [8] 
combine different metaheuristics with other optimization ideas in order to complement the strengths 
of different approaches. In the following, we enumerate a few hybrid metaheuristics based on ACO, 
namely, hybridization with beam search (Beam-ACO) and with constraint programming (CP), and the 
application of ACO in multilevel frameworks. 


26.6.1 Hybridization with Beam Search 


Beam search (Bs) is a classical tree search method for combinatorial optimization [59]. In particular, 
Bs is a deterministic approximate algorithm that relies heavily on bounding information for selecting 
among partial solutions. In Bs, a number of solutions—the beam—are iteratively constructed interde- 
pendently and in parallel. Each construction step consists of a first step where a number of candidate 
extensions of the partial solutions in the beam are selected based on heuristic information. In a second 
step, the candidate extensions form new partial solutions, and the Bs algorithm selects a limited number 
of these partial solutions by means of bounding information. The combination of ACO and Bs—labeled 
Beam-ACO—replaces the ant-based solution construction procedure of ACO with a probabilistic beam 
search procedure [6,7]. The probabilistic beam search utilizes pheromone information for the selection 
of candidate extensions of the beam. A further improvement, in cases where the bounding informa- 
tion is computationally expensive or unreliable, is replacing the bounding information with stochastic 
sampling of partial solutions [44]. Stochastic sampling in Beam-ACO executes the ant-based solution 
construction procedure of ACO to obtain a number of complete solutions (samples) from each partial 
solution in the beam. The best sample of each partial solution is considered an estimation of its quality. 


26.6.2 ACO and Constraint Programming 


Highly constrained problems, such as scheduling or timetabling, pose a particular challenge to meta- 
heuristics, since the difficulty lies not simply in finding a good solution among many feasible solutions, but 
in finding feasible solutions among many infeasible ones. Despite the fact that ACO algorithms generally 
obtain competitive results for many problems, the performance of classical ACO algorithms has not been 
entirely satisfactory in the case of overly constrained problems. These problems have been targeted by 
means of CP techniques [52]. Hence, the application of CP techniques for restricting the search performed 
by an ACO algorithm to promising regions of the search space [54] is not too far-fetched. 


26.6.3 Multilevel Frameworks Based on ACO 


Multilevel techniques [14,71] start from the original problem instance and generate smaller and smaller 
instances by successive coarsening until some stopping criteria are satisfied. This creates a hierarchy of 
problem instances in which the problem instance of a given level is always smaller than the problem 
instance of the next lower level. Then, a solution is generated for the smallest problem instance, and 
successively transformed into a solution of the next higher level until a solution for the original problem 
instance is obtained. In a multilevel framework based on ACO, an ACO algorithm is applied at each 
level of the hierarchy to improve the solutions obtained at lower levels [13,42,43]. 


26.7 Applications of ACO Algorithms 


Since the first application of AS to the travelling salesman problem in the early 1990s, the scope of ACO 
algorithms has widened considerably. Researchers have applied ACO to classical optimization problems 
such as assignment problems, scheduling problems, graph coloring, the maximum clique problem, and 
vehicle routing problems. Recent real-world applications of ACO include cell placement problems aris- 
ing in circuit design, the design of communication networks, bioinformatics problems, and the optimal 
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TABLE 26.2 ACO Applications 


Problem References 
Traveling salesman problem [22-25,70] 
Quadratic assignment problem [49,51,70] 
Scheduling problems [6,12,19,28,53,69] 
Vehicle routing problems (31,61] 
Timetabling [67] 

Set packing [32] 

Graph coloring [18] 

Shortest supersequence problem (55] 

Sequential ordering [30] 

Constraint satisfaction problems [68] 

Data mining [60] 

Maximum clique problem {15] 
Edge-disjoint paths problem {5] 

Cell placement in circuit design {1] 
Communication network design [50] 
Bioinformatics problems [13,39,40,58,62,63] 
Industrial problems [2,7,17,34,64] 
Water distribution networks [47,48] 
Continuous optimization [4,27,41,56,65,66] 
Non-static problems [3,36] 
Multi-objective problems [21,37,45,46] 
Music [35] 


Intelligent Systems 


design and operation of water distribution networks. Furthermore, there exists ongoing work on the 
application of ACO to non-static problems. Finally, a recent research trend is the extension of ACO to 
deal with problems with multiple objectives. Table 26.2 provides a list of representative applications of 
ACO algorithms. Dorigo and Stiitzle [26] have compiled a comprehensive list of references. 


26.8 Concluding Remarks 


Finally, we shortly want to elaborate on the question when ACO should be used. First of all, ACO is, in 
general, not superior to any other general purpose optimizer. This results from work that is known as 
no-free-lunch [72]. However, for specific problems (see Table 26.2), ACO might, of course, work better 
than other techniques. In general, ACO can be expected to work well for optimization problems for 
which well-working constructive heuristics are known. Moreover, ACO can only work if the search 
space is such that good solutions are concentrated in certain areas of the search space. On the contrary, 
if good solutions are scattered all over the search space, there is nothing that can be learned from already 
visited solutions. Unfortunately, it is currently impossible to make more specific claims about the gen- 
eral suitability of ACO for different classes of problems. However, this is not limited to ACO but happens 
for all general purpose optimizers. 
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27.1 Introduction 


In the wood cutting industry, the central step of the manufacturing process is cutting large two- 
dimensional boards of raw material to obtain the desired items. The operation of the cutting machines is 
often driven by intelligent software systems with graphical user interfaces (GUIs), which help the opera- 
tor planning the cutting operations, and may include other features, such as establishing the sequence 
in which the boards should be cut. 

The characteristics of the set of items that are grouped to be cut in a set of large boards may vary 
largely. There are companies that produce large quantities of items of a small number of different sizes, 
as happens for instance in furniture companies that produce semifinished articles with standard sizes. 
On the other hand, there are make-to-order furniture companies that produce finished goods, which 
typically solve different problems every day, with a large variety of item sizes, in small quantities. 

Intelligent software systems for this industry may include different heuristic algorithms tailored to 
provide good quality solutions to instances with a wide variety of characteristics, because optimal solu- 
tions may be beyond reach, due to the difficulty of the problem. In this chapter, we address heuristic 
procedures that are more suitable to make-to-order companies. 

In Section 27.2, the concept of bin packing related to those cutting operations and its characteristics 
such as cutting methods and packing strategies are presented. In Section 27.3, we review several heuristics 
for solving bin packing problems as the level-oriented (shelf-oriented) one-phase and two-phase heuristics, 
and introduce new local search heuristics. 

In Section 27.4, computational results of benchmark and real-world problems obtained by using 
a recently proposed heuristic, a greedy heuristic (one-phase heuristic) together with stochastic 
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neighborhood structures (local search heuristic), and by considering various problem scenarios, 
i.e., various combinations of the following three parameters: (1) with or without item rotation, 
(2) horizontal or vertical cut(s) applied in the first-stage cutting, (3) two or three-stage exact/non-exact 
case, are summarized. Finally, in Section 27.5, the main conclusions are drawn. 


27.2 Bin-Packing Problems 


Bin-packing problems are well-known combinatorial optimization problems. They are closely related to 
cutting stock problems and these two problem types are conceptually equal (Wascher et al. 2007). Thus, 
some terminologies borrowed from cutting stock problems will be used in the following discussion. The 
common objective of two-dimensional bin-packing problems is to pack a given set of rectangular items 
to an unlimited number of identical rectangular bins such that the total number of used bins is mini- 
mized and subject to three limitations: (1) all items must be packed to bins, (2) all items cannot overlap, 
and (3) the edges of items are parallel to those of the bins. Nowadays, these problems are always faced by 
wood, glass, paper, steel, and cloth industries. 

Since determining the packing location of each item relies on the cutting approaches used, those 
approaches are firstly presented. Guillotine cutting and free cutting are the two possible types of cutting 
methods. In particular, the former approach is frequently needed because of technological characteris- 
tics of automated cutting machines. A guillotine cutting means the one from an edge of the rectangle 
to the opposite edge. Guillotine cuttings that are applied n times are referred to n-stage guillotine cut- 
ting. Examples of the two-stage and three-stage guillotine cutting are portrayed in Figures 27.1 and 
27.2, respectively. In these two figures, the bold lines represent horizontal (or vertical) cuts in the cor- 
responding stages. Also, the gray rectangles stand for the waste materials. For both the two-stage and 
three-stage guillotine cutting, processes start with horizontal cut(s) in the first stage and then verti- 
cal cut(s) in the second stage. However, further horizontal cut(s) are applied in the third stage of the 
three-stage guillotine cutting. Although the process of guillotine cutting described above starts with 
horizontal cut(s) in the first stage, the direction of cutting(s) can be chosen as vertical in the first stage, 
as an alternative. As long as this happens, horizontal cut(s) and vertical cut(s) will be performed in the 
second and third stage (for three-stage cutting only), respectively. Furthermore, additional cut(s) can be 
applied to separate the waste from items after the final stage and this is known as trimming. However, 
these cuts are not considered as an additional stage. A free cutting means that the cutting does not have 
any restriction, i-e., non-guillotine. An instance of a free cutting is depicted in Figure 27.3. This chapter 
focuses on the guillotine cutting. 

In order to easily fit the guillotine-cutting approach, the most common way of packing items to 
bins is the level-oriented (shelf-oriented) method. The idea of this method is that shelves are created by 
packing items from left to right. The height of each shelf is determined by the tallest item resided in the 
leftmost of the shelf. Also, whether an item is packed on the top or to the right of another packed item or 
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FIGURE 27.1 An example of two-stage guillotine cutting. 
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FIGURE 27.2 An example of three-stage guillotine cutting. 


FIGURE 27.3 An example of free cutting. 


in a new shelf (i.e., the packing location of the item) depends on the following factors: the size of the free 
space available, the number of cutting stages required and whether trimming is adopted. In this chapter, 
the numbers of stages considered are two and three. Accordingly, four different level-oriented scenarios 
can be formed: (1) two-stage without trimming, (2) two-stage with trimming, (3) three-stage without 
trimming, and (4) three-stage with trimming. Figure 27.4a through d illustrates examples of these four 
scenarios, respectively. Assume that there are 13 and 23 items packed in a bin for the two-stage and 
three-stage problems, respectively, and the packing order of items is based on the ascending order of 
the item index. For both scenarios (a) and (b), since they are two-stage problems, no item can be packed 
on the top of any packed item in each shelf. In scenario (a), trimming is not allowed after the final stage 
and therefore the height of each item in the same shelf must be identical. Nevertheless, it is permitted 
in scenario (b) and thus the free space can exist on the top of any packed item except the leftmost item 
in each shelf. For both scenarios (c) and (d), they are three-stage problems and therefore an item can be 
packed on the top of any packed item except the leftmost item in each shelf. As trimming is not allowed 
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FIGURE 27.4 Examples of four different level-oriented scenarios. 


in scenario (c), the width of each item in the same stack must be equivalent. However, this restriction is 
not applied in scenario (d) because trimming is permitted. Note that the leftmost item in each shelf is 
also a stack, which determines the height of the shelf. 

There are two possible ways to deal with items when packing them into bins: (1) rotation of items is 
not permitted, i.e., fixed orientation, and (2) rotation of items is permitted by 90°. Items in the form of 
some raw materials cannot be rotated. To cite an example, some wood and cloth have decorative pat- 
terns, which are required to be fixed in a particular direction when pieces are assembled. Nevertheless, 
the items may be rotated provided that they are plain materials. 

Based on different combinations of feasibility of item rotation and cutting methods, Lodi et al. (1999) 
proposed a classification of two-dimensional bin-packing problems (2BP) for four cases, which are 
(1) 2BP|O|G, (2) 2BP|R|G, (3) 2BP|O|E, and (4) 2BP|R|F, where O denotes that items are oriented, ie., 
they cannot be rotated; R denotes that items may be rotated by 90°; G signifies that guillotine cutting is 
used; and F signifies that free cutting is adopted. This kind of classification can help unify definitions 
and notations, facilitate communication between researchers in the field, and offer a faster access to the 
relevant literature (Wascher et al. 2007). 

Cutting and packing problems have been extensively studied in the last decade. Some websites such 
as EURO Special Interest Group on Cutting and Packing (ESICUP), PackLibi?, and OR-Library were 
established in order to facilitate the researchers to collect benchmark problems proposed in the past to 
test the efficiency and effectiveness of their suggested algorithms. 


© 2011 by Taylor and Francis Group, LLC 


Heuristics for Two-Dimensional Bin-Packing Problems 27-5 


The possible optimization methods to solve bin-packing problems are exact methods, heuristics, and 
meta-heuristics (Lodi et al. 1999; Carter and Price 2001; Sait and Youssef 1999). Even though it is guar- 
anteed that exact methods can find an optimal solution, the difficulty of obtaining an optimal solution 
increases drastically if the problem size increases, due to the fact that it is an NP-hard problem (Garey 
and Johnson 1979). For large instances, alternatively, heuristics or meta-heuristics approaches are able 
to search a good-quality solution in a reasonable amount of time. In particular, this chapter will focus 
on presenting the implementation of different level-oriented heuristics and local search-based heuris- 
tics proposed in the past and recently for solving bin-packing problems. 


27.3 Heuristics 


This section is devoted to discussing various heuristics proposed for tackling bin-packing prob- 
lems. There are two scenarios for bin-packing problems: (1) The information of all items is known 
prior to solving the problem. It is called an offline approach. (2) When an item is being packed, the 
information of the next item is unknown. It is named an online approach. In this chapter, we will 
focus on heuristics for only the former one. Heuristics for the offline approach can be classified 
into two types (Lodi et al. 2002): (1) one-phase heuristics and (2) two-phase heuristics. The idea of 
the first one is to pack items into bins directly while the second one is to aggregate items to form 
strips (shelves) in the first stage and then pack strips into bins in the second stage. In the following, 
the descriptions of one-phase heuristics, two-phase heuristics, and local search heuristics will be 
provided in order. 


27.3.1 One-Phase Heuristics 
27.3.1.1 Finite First Fit 


This algorithm was proposed by Berkey and Wang (1987). First, the items are sorted by nonincreasing 
height. Starting from the lowest to the highest level of the first used bin, the current item is packed into 
the level currently considered if it fits. If no level can fit it but the residual height is sufficient, it is packed 
into a new level created in the same bin. Otherwise, the same steps are applied to the subsequently used 
bins. Ifno bin can accommodate it, it is loaded into a new level of a new bin. 

Figure 27.5a depicts an instance of this method. Sorted items 1 and 2 are first loaded into the first level 
of the first bin. As the empty space at the right of item 2 is not large enough to accommodate item 3, it is 
packed into the newly created second level. Then, items 4 and 5 are packed into the “first-fit” free spaces 
of the first and second levels, respectively. It is needed to load item 6 into the newly created third level 
because the empty spaces of the first or second level cannot fit it. Similarly, item 7 is loaded into a new 
bin since no level of the first bin can fit it. 


27.3.1.2 Finite Next Fit 


This heuristic was proposed by Berkey and Wang (1987). First, the items are sorted by nonincreasing 
height. The current item is packed into the current level of the current bin if it fits. Otherwise, it is packed 
into a new level created in the current bin if the residual height is sufficient. Otherwise, it is loaded into a 
new level of a new bin. 

Figure 27.5b shows an example of this approach. After sorting the items, items 1 and 2 are first loaded 
into the first level of the first bin. Since the empty space at the right of item 2 is not large enough to fit 
item 3, it is packed into the newly created second level. After packing item 4, it is required to load item 5 
into the newly created third level because the empty space at the right of item 4 is not sufficiently large. 
After packing item 6, item 7 is loaded into the first level of a new bin since the third level of the first bin 
cannot accommodate it. 
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FIGURE 27.5 Examples of finite first fit and finite next fit. 


27.3.1.3 Greedy Heuristic 


A greedy heuristic was proposed by Alvelos et al. (2009). The two elements involved in the heuristic are 
described as follows: 


« A stack is a set of items, each of which, except the bottom one, is placed on the top of another. 
For the two-stage problems, since no further horizontal cut(s) is/are permitted after the second- 
stage guillotine cutting, except trimming(s), each stack must possess only one item but such a 
constraint is not required in the three-stage problems. 

e Ashelf, which is a row of a bin, contains at least one stack or item. 


The idea of the heuristic is based on the sorting of item types for defining an initial packing sequence 
and iterative trials of packing items into existing stacks, shelves, or bins according to the following cri- 
teria. For the former one, the three criteria considered to sort the item types in the descending order are 
(1) by width, (2) by height, (3) by area. For the latter one, it is possible that more than one existing stack, 
shelf, or bin can accommodate each item. Therefore, criteria should be established to determine which 
existing stack, shelf, and bin should be selected for packing the items. The three criteria used to achieve 
this purpose are: (1) minimize the residual width after packing the item, (2) minimize the residual 
height after packing the item, (3) minimize the residual area after packing the item. Note that different 
criteria can be adopted for stack, shelf, and bin selection. 

The first step of the heuristic is to define a packing sequence for items by sorting the item types 
based on a specified criterion. Then, iteratively, each item, which can or cannot be rotated, is packed 
into an existing stack that minimizes a specified criterion. If this is not possible, we try to pack it into 
an existing shelf, which minimizes a specified criterion. If this is not possible, we try to pack it into a 
used bin, which minimizes a specified criterion. If this is not possible again, it is placed into a new bin. 
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FIGURE 27.6 An example of constructive heuristic. 


Finally, a solution is obtained by running the heuristic with several sets of criteria and then selecting 
the best one among the obtained solutions. 

Figure 27.6 illustrates an example of the greedy heuristic. Assume that the items are sorted by height 
and four items have already been loaded into the bin. The item 5 currently considered can be loaded into 
an existing stack (free space A) (for three-stage problems only), an existing shelf (free space C or D), on 
the top of shelves of the bin (free space E), or a new empty bin. However, the free space B cannot accom- 
modate it. The place where the item is packed depends on the preselected criterion and the type of the 
problem tackled. 

Suppose that a two-stage problem with trimming is now being solved. In this case, only three possible 
choices of empty spaces, C, D, and E, are available for packing item 5. If the criterion (1) is set, it will be 
packed into C to fulfill the criterion; if criterion (2) is selected, it will be packed into D; if criterion (3) is 
chosen, it will be placed into D. Note that the same applies to a three-stage problem without trimming. 
Nevertheless, if it is a three-stage problem with trimming, and the criterion (1) or (3) is set, it will be 
placed into A instead of C or D. 


27.3.2 Two-Phase Heuristics 


27.3.2.1 Hybrid First Fit 


The hybrid first fit algorithm was proposed by Chung et al. (1982). In the first phase, a strip packing is 
performed by using the first-fit decreasing height strategy whose principle is given as follows. An item 
is packed in the left-justified way in the “first-fit” level. If no level can accommodate the item, a new level 
is created and the item is packed in the left-justified way in this level. Now, the problem will become 
a one-dimensional bin-packing problem. In the second phase, this problem is solved by means of the 
first-fit decreasing algorithm whose procedures are provided as follows. The first bin is created for pack- 
ing the first strip. For subsequent strips, the current strip is packed into the “first-fit” bin. If no bin can 
accommodate the strip, a new bin is initialized. 

Figure 27.7a illustrates an example. First, the strip packing is conducted in a bin with the same width 
as an actual bin and infinite height. Its packing procedures are the same as the example given in the 
finite first fit described in Section 27.3.1.1. Then, after packing the first strip into the first bin, the remain- 
ing empty space is not large enough to fit the second strip. Thus, it is packed into the second bin. The 
third strip is packed into the first bin because it can first fit the remaining empty space. The same applies 
to the last strip in the second bin. 


27.3.2.2 Hybrid Next Fit 


This algorithm was suggested by Frenk and Galambos (1987). The principle of this approach is similar 
to that of hybrid first fit. In the first stage, the next-fit decreasing height strategy is adopted to carry out 
a strip packing. This strategy is described as follows. If the current level can accommodate an item, it 
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FIGURE 27.7. Examples of hybrid first fit, hybrid next fit, and hybrid best fit (finite best strip). 
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is packed in the left-justified way in this level. Otherwise, a new level is created and the item is packed 
in the left-justified way in the new level. Then, in the second stage, the one-dimensional bin-packing 
problem is solved by using the next-fit decreasing algorithm, the description of which is given as follows. 
The current strip is packed into the current bin where it fits. If the current bin cannot accommodate the 
strip, a new bin is initialized. 

Figure 27.7b shows an example. First, the strip packing is performed in a bin with the same width as 
an actual bin and infinite height. Its packing procedures are the same as the example given in the finite 
next fit described in Section 27.3.1.2. Then, after loading the first strip into the first bin, the remaining 
empty space is not large enough to fit the second strip. Thus, it is packed into the second bin. Since the 
remaining empty space of the second bin can accommodate both the third and fourth strips, they are 
placed into that bin. The last strip is loaded into the third bin as the remaining empty space of the second 
bin cannot accommodate it. 


27.3.2.3 Hybrid Best Fit (Finite Best Strip) 


The hybrid best fit approach was suggested by Berkey and Wang (1987). The implementation of this 
approach is similar to that of hybrid first fit. The best-fit decreasing height strategy is adopted to conduct 
a strip packing in the first stage. The idea of this strategy is that an item is packed in the left-justified way 
in the level satisfying two criteria: (1) it fits the item and (2) the residual width is minimized. If no level 
can accommodate the item, a new level is created and the item is loaded in the left-justified way in this 
level. In the second stage, the one-dimensional bin-packing problem is attacked by the best-fit decreas- 
ing algorithm whose procedures are given as follows. The current strip is packed into the bin fulfilling 
two criteria: (1) it fits the strip and (2) the residual height is minimized. If no bin can accommodate the 
strip, a new bin is created. 

Figure 27.7c portrays an instance. First, the strip packing is implemented in a bin with the same width 
as an actual bin and infinite height. After packing the sorted items 1 and 2, item 3 does not fit the empty 
space at the right of item 2. Therefore, it is packed into the second strip. Item 4 is then loaded into the 
second strip rather than the first strip because the residual width is minimized in the former strip. The 
same applies to item 5 loaded into the first strip. Finally, items 6, 7, and 8 are placed into the third and 
fourth strips, respectively. Now, it is needed to pack strips into bins. After packing the first strip into the 
first bin, the remaining empty space is not large enough to fit the second strip and, therefore, it is packed 
into the second bin. The third strip is loaded into the first bin because this can minimize the residual 
height. The last strip does not fit the remaining empty space of the first bin and thus it is loaded into the 
second bin. 


27.3.2.4 Floor Ceiling 


This approach was proposed by Lodi et al. (1999). In this algorithm, ceiling of a level is defined as the 
horizontal line touching the upper edge of the tallest item packed in the level. Floor ceiling packs the 
items not only from left to right on the floor of a level, but also from right to left on the ceiling of the level. 
However, the condition of packing the first item on the ceiling is that this item cannot be packed on 
the floor. 

In the first phase, the levels are created by packing items to them in the following order: (1) on a ceil- 
ing by using a best-fit algorithm if the condition aforementioned is satisfied, (2) on a floor by means of 
a best-fit algorithm, and (3) on the floor of a new level. In the second phase, the levels are packed into 
bins by using either the best-fit decreasing algorithm or an exact algorithm for the one-dimensional bin- 
packing problem, halted after a prefixed number of iterations. 

Floor ceiling was initially designed for non-guillotine bin packing. However, it can be amended to 
support guillotine bin packing. Also, the modified variant can be used to solve only three-stage prob- 
lems because of its intrinsic characteristic. Figure 27.8a and b illustrates the difference between packing 
without and with guillotine constraint, respectively, in the first phase of floor ceiling. In the former case 
shown in Figure 27.8a, after packing the first five sorted items on the floor of the first level, the empty 
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Second level Second level 


First level First level 


(a) (b) 


FIGURE 27.8 Difference between packing without and with guillotine constraint in the first phase of floor ceiling. 


space at the right of item 5 is not large enough to fit item 6 and, thus, it is the first one packed on the 
ceiling of the same level. The same applies to item 7. Since item 8 cannot fit the spaces both on the floor 
and ceiling of the first level, it is loaded onto the floor of the second level. Similarly, items 9, 11, and 10 
are placed into the first and second level, respectively. 

In the latter case shown in Figure 27.8b, the first five sorted items are placed in the same way as the 
former case in Figure 27.8a. Now, it is needed to delimit the empty spaces as four smaller areas bounded 
by dotted lines, where subsequent items can be placed to meet the guillotine constraint in the first level 
and this is also applied to each of the other levels. Since the empty space at the right of item 5 is not suf- 
ficiently large to accommodate item 6, it is required to search an empty space bounded by dotted lines, 
whose width is equal to or larger than that of item 6 from right to left on the ceiling. The empty space 
above item 4 satisfies this condition and thus item 6 is loaded there with either one of its vertical sides 
touching a vertical dotted line. In this example, the right side is used. The same rule applies to item 7. 
Like the example in Figure 27.8a, as item 8 cannot fit the spaces both on the floor and ceiling of the first 
level, it is packed on the floor of the second level. Items 9, 10, and 11 are loaded by using the same rules 
described above. Note that the scenario being tackled is a three-stage problem with trimming. For the 
non-trimming case, items 11, 9, 6, and 7 placed at those empty spaces must have the same widths as 
items 2, 3, 4, and 5, respectively. 


27.3.3 Local Search Heuristics 


A local search algorithm aims at hopefully and efficiently finding a good solution by conducting a 
sequence of tiny perturbation on an initial solution and deals with one solution (current solution) at a 
time. Before its implementation, the neighborhoods, i.e., the operation of how to obtain a different solu- 
tion from the current solution, have to be defined. The procedures of a local search algorithm are given 
as follows. First, an initial solution is generated and the current solution is set as the initial solution. The 
value of the current solution is calculated. The iteration loop starts with acquiring a neighborhood of the 
current solution through the defined operation and its objective value is computed. In the first improve- 
ment scheme, if the value of the neighborhood solution outperforms that of the current solution, the 
current solution is replaced by the neighborhood. Otherwise, the current solution remains unchanged. 
In the best improvement (steepest descent) scheme, all the neighbor solutions are evaluated and the best 
one is selected for comparing with the current solution. 

The iteration loop ends here and repeats again until no neighbor solution improves the current 
solution. The two local search heuristics named variable neighborhood descent (VND) and stochastic 
neighborhood structures were built by means of the concept of the local search algorithm and will be 
introduced as follows. 


27.3.3.1 Variable Neighborhood Descent 


VND is a meta-heuristic proposed by Mladenovic and Hansen (Hansen and Mladenovic 1999, 2001; 
Mladenovic and Hansen 1997). The concept of VND is to systematically utilize different neighborhood 
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structures. The brief description of VND is given as follows. First, a set of neighborhood structures, 
which will be adopted in the descent, is chosen and an initial solution is found. Now, an iteration loop 
starts. The first neighborhood structure is used to search the best neighbor of the current solution. If the 
best neighborhood solution outperforms the current solution, the latter one is replaced by the former 
one. Otherwise, the next (second, third, and so on) neighborhood structure is considered. Finally, the 
iteration loop repeats until all neighborhood structures are utilized. 

This chapter will present three neighborhood structures devised in a particular order for the sequen- 
tial VND that was proposed by Alvelos et al. (2009) and different from the common one described 
above. In their study, the three neighborhood structures arranged in the fixed order are implemented 
one by one as a loop until the time limit is reached or the current solution cannot be improved. 

It is vital to realize the solution representation and evaluation function before devising neighborhood 
structures. A solution is represented by a sequence of items satisfying the demand of each item. Let us 
consider an example as follows. Assume that items 1, 2, 3 have demands 3, 5, 2, respectively. A feasible 
solution is represented by 2, 2, 2, 2, 2, 1, 1, 1, 3, 3. The first five items of type 2 will be packed first. Next, 
three items of type 1 will be loaded and finally the last two items of type 3. Note that the solution repre- 
sentation before implementing any neighborhood structure is the same as that after. Moreover, in order 
to examine the quality of a solution, an evaluation function f is required for evaluating the solution, 
which is given by 


f =My- a, +0,-—m (27.1) 


where 
n, is the number of used bins 
a, is the area of a bin 
o, is the occupied area of the bin in solution with the smallest occupied area 
m is the number of items packed in the same bin 


The rationale behind this function is that a solution with fewer used bins always outperforms other 
solutions with more used bins and that if two solutions have the same number of used bins, the solution 
where it is easier to empty one used bin is better than the other one. In the following, these neighbor- 
hood structures are introduced in order. 


27.3.3.1.1 First Neighborhood Structure: Swap Adjacent Item Types 


“Swap adjacent item types” is aimed at swapping all items in two adjacent item types. Figure 27.9 illus- 
trates an instance of this neighborhood structure. Assume that two highlighted adjacent item types, 
2 and 4, are chosen. One item in the first type and three items in the second type are exchanged to 
produce the new solution. 


27,3.3.1.2 Second Neighborhood Structure: Swap Adjacent Item Subsequences 


The mission of “swap adjacent item subsequences” is to swap two adjacent item subsequences, both of which 
have the same size. A size parameter is used to define the size of the neighborhood. Figure 27.10a and b 


Current solution Before swap 


1 2] 3 1 


Newsolution| 1 | 2 | 3 | 1 | 3 | 3 | 4] 4 | 4 2 | After swap 


FIGURE 27.9 An example of “swap adjacent item types.” 
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Current solution | 1 2 | 3 | 1 | 3 | 3 [2 | 4 4 4 | 2 | Before swap 


New solution 1 2 3 1 3 3 4 4 4 | 2 | After swap 
(a) 


Current solution | 1 2 | 3 | 1 | 3 [2 [2] 


4 
New solution 1 2 1 3 4 4 4 | 2 | After swap 


(b) 


4 4 | 2 | Before swap 


FIGURE 27.10 Examples of “swap adjacent item subsequences.” 


portrays instances of this neighborhood structure for two different sizes of the neighborhood. For the former 
one whose size of the neighborhood is one, suppose that two highlighted adjacent item subsequences, 3 and 2, 
are chosen in the whole item packing sequence (solution). Then, these two item subsequences are exchanged 
to complete the operation. For the latter one whose size of the neighborhood is two, two shaded adjacent item 
subsequences, 1, 3 and 3, 2, are selected in the current solution and then swapped to produce the new solution. 


27,3.3.1.3 Third Neighborhood Structure: Reverse Item Subsequences 


The objective of “reverse item subsequences” is to reverse the order of an item subsequence with a given size. 
A size parameter is utilized to define the size of the item subsequence. Figure 27.1la and b shows examples 
of this neighborhood structure for two various sizes of the item subsequences. The item subsequences 3, 2, 
and 3, 2, 4 are selected and then their packing orders are reversed to generate new solutions, respectively. 


27.3.3.2 Stochastic Neighborhood Structures 


Stochastic neighborhood structures (SNS) proposed by Chan et al. (2009) are adopted in a way similar to 
VND. Since a local optimal solution corresponding to one neighborhood structure is not necessarily the 
same as that corresponding to another neighborhood structure, the use of several different neighbor- 
hood structures as the basic concept of SNS/VND can further improve the current local optimal one. 
The differences between SNS and VND are that SNS (1) impose the restriction of using all stochastic 


Current solution | 1 213 1 | 3 


4 | 4 | 4 | 2 | Before reverse 


New solution | 1 2 3 1 | 3 4 | 4 | 4 | 2 After reverse 


(a) 


Current solution | 1 2) 3 1 3 2 | Before reverse 
New solution | 1 213 1 | 3 2 | After reverse 


(b) 


FIGURE 27.11 Examples of “reverse item subsequences.” 
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neighborhood structures rather than deterministic or mixed ones, (2) use the fixed number of iterations 
to explore better neighborhood solutions for each or all neighborhood structures, and (3) do not use the 
iteration loop for neighborhood structures. 

In their study, SNS is employed to improve the quality of the solution given by the greedy heuristic 
in the first phase. In each neighborhood structure of the proposed approach, instead of finding the best 
neighbor of the initial solution by complete enumeration, only one neighbor is randomly generated in 
each time and then compared with the initial solution. The advantage of this modification is that a large 
computational load is not required to search all neighbors of the initial solution, especially when the 
problem size is huge. Three neighborhood structures are proposed to be implemented in a fixed order. 
In the following, these neighborhood structures will be introduced in order. 


27.3.3.2.1 First Neighborhood Structure: Cut-and-Paste 


“Cut-and-paste” is a genetic operation, which was applied in the jumping-gene paradigm to solve multi- 
objective optimization problems (Chan et al. 2008). Its implementation is that the “jumping” element is cut 
from an original position and pasted into a new position of a chromosome. In this study, “cut-and-paste” 
is applied to the solution in a way that there is only one “jumping” segment in the solution, and the length, 
original position, and new position of the “jumping” segment are randomly chosen. Figure 27.12 depicts 
an example of this neighborhood structure. Given that the randomly generated length is 4, the original 
position is 6 and the new position is 2. In other words, the highlighted segment (4, 2, 1, 4) is randomly 
selected. The segment is cut from the original position and then pasted into the new position to complete 
the operation. 


27.3.3.2.2 Second Neighborhood Structure: Split-and-Redistribute 


The objective of “split-and-redistribute” is to split various blocks with a given length from the solution 
and redistribute them to the solution. The total number, length, original positions, and new positions of 
the blocks are randomly selected and all blocks have the same length. Figure 27.13 illustrates an instance 
of this neighborhood structure. Suppose that the randomly selected total number is 3, the length is 2, the 
original positions are 1, 5, and 10, and the new positions are 10, 1, and 5 for the three blocks, respectively. 
That is, the three highlighted blocks (1, 3), (3, 4), and (4, 4) are randomly chosen. These three blocks are 
split and then redistributed to their new positions to acquire the new solution. 


Current solution | 1 


New solution | 1 


FIGURE 27.12 An example of “cut-and-paste.” 


Current solution 


New solution 


FIGURE 27.13 An example of “split-and-redistribute.” 
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Current solution Before swap 


New solution 


After swap 
FIGURE 27.14 An example of “swap block.” 


27.3.3.2.3 Third Neighborhood Structure: Swap Block 


“Swap block” is aimed at exchanging two different blocks with a given length in the solution. The length 
and the positions of the two blocks are randomly selected and the lengths of these two blocks are 
equivalent. Figure 27.14 portrays an example of this neighborhood structure. Assume that the randomly 
selected length is 3 and the randomly chosen positions are 4 and 8. That means the two highlighted 
blocks (2, 3, 4) and (1, 4, 4) are randomly selected. Then, these two blocks are swapped to produce the 
new solution. 


27.4 Computational Results 


In this section, the computational results and times obtained by using the recently proposed heuristic, i.e., 
a greedy heuristic (one-phase heuristic) together with stochastic neighborhood structures (local search 
heuristic), are summarized. Four different sets of benchmark problems were used to verify the effective- 
ness of the heuristic: (1) instances cgcutl1—cgcut3 from the (Christofides and Whitlock 1977) study, (2) 
instances gcutl—gcut13 and ngcut1—ngcut12 from the two Beasley studies, respectively (Beasley 1985a, b), 
(3) 300 instances (they are named B&KW1—B&W300 in this chapter) from the (Berkey and Wang 1987) 
study, and (4) 200 instances (they are named M&V1—M&V200 in this chapter) from the (Martello and 
Vigo 1998) study. Two sets of 47 and 121 real-world instances offered by furniture companies were also 
adopted. Different scenarios with all the possible combinations of the following three parameters were 
considered: (1) whether rotation by 90° is allowed and prohibited for each item, (2) horizontal or verti- 
cal cut(s) in the first-stage cutting, and (c) four cases: 2-stage without trimming, 2-stage with trimming, 
3-stage without trimming, or 3-stage with trimming. In the following, the relative gaps between the 
heuristic solutions and the optimal solutions found by an exact method based on the pseudo-polynomial 
integer programming model (Silva et al. 2010) and averages of percentage of waste are given in order 
to reflect the quality of the proposed heuristic. A relative gap is calculated by means of the following 
equation: 


77 
GAP (%) =—# =" x 100 (27.2) 


Opt 


where 
Z,, is the heuristic solution 
Zopr is the optimal solution obtained by the exact method 


However, since optimal solutions for only the case of horizontal cut(s) in the first-stage cutting are 
known, only gaps belonging to this case will be reported. For the details about the implementation 
aforementioned, please refer to Chan et al. (2009). 

Tables 27.1 and 27.2 show the ranges of the gaps and averages of percentage of waste with all possible 
combinations of parameters for all benchmark and real-world problems, respectively. From Table 27.1, 
it can be observed that all maximum percentages of the ranges of the gaps are small for all scenarios 
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TABLE 27.1 Ranges of Relative Gaps between the Heuristic Solutions and the Optimal Solutions Found 
with All Possible Combinations of Parameters for All Benchmark and Real-World Problems 
Horizontal Cut(s) at the First Stage 
2-Stage without 2-Stage with 3-Stage without 3-Stage with 

Instances Trimming Trimming Trimming Trimming 
cgcutl - 3, gcutl - 13, ngcutl - 12 NR — [0%, 0.54%] [0%, 0%] [0%, 0%] [0%, 0.92%] 

R [0%, 8.70%] [3.13%, 4.35%]  [0%, 3.23%] [0%, 3.23%] 
B&W1 - 300, M&V1 - 200 NR — [0%, 0.22%] [0%, 1.67%] [0%, 2.10%] [0%, 2.82%] 

R [0%, 4.83%] [0%, 3.10%] [0%, 0.81%] [0%, 0.81%] 
First set of 47 real-world instances NR — [0%, 2.81%] [0%, 4.38%] [0%, 4.65%] [0%, 4.65%] 


R = 
Second set of 121 real-world instances NR — [0.78%, 0.95%] 
R [1.09%, 4.91%] 


[0.93%, 1.85%] 
[1.66%, 3.03%] 
[2.02%, 3.42%] 


[0.94%, 0.94%] 
[1.91%, 3.06%] 
[2.79%, 3.92%] 


(0.94%, 0.94%] 
[1.98%, 3.09%] 
[2.59%, 3.78%] 


Note: NR: no rotation is allowed; R: rotation is allowed; [x, y]: x is the minimum percentage and y is the maximum 
percentage. 


TABLE 27.2 Ranges of Averages of Percentage of Waste with All Possible Combinations of Parameters 
for All Benchmark and Real-World Problems 


Horizontal Cut(s) at the First Stage 


2-Stage without 2-Stage with 3-Stage without 3-Stage with 
Instances Trimming Trimming Trimming Trimming 
cgcutl - 3, gcutl - 13, NR = [44.02%, 59.04%]  [28.67%, 37.89%]  [26.11%, 32.28%]  [26.11%, 30.51%] 
ngcutl - 12 R [31.27%, 44.38%]  [22.65%, 30.07%]  [21.26%, 28.43%] — [21.26%, 28.43%] 
B&W1 - 300, M&V1-200 NR _— [23.69%, 79.12%]  [13.53%, 36.75%]  [12.48%, 36.75%] — [12.68%, 36.75%] 
R [10.35%, 65.20%]  [8.88%, 36.43%] [8.62%, 36.43%] [8.62%, 36.43%] 
First set of 47 real-world NR [19.54%, 37.90%]  [12.75%, 32.76%]  [12.75%, 30.40%] — [12.75%, 30.40%] 
instances R [15.49%, 34.58%]  [10.86%, 28.59%] — [10.299%, 26.69%]  [9.02%, 26.23%] 
Second set of 121 real-world NR  [13.64%, 23.24%] — [11.48%, 20.45%] — [11.51%, 20.03%] — [11.44%, 20.2896] 
instances R [12.55%, 19.36%]  [8.75%, 17.49%] [8.68%, 17.12%] [8.68%, 17.01%] 
Vertical Cut(s) at the First Stage 
cgcutl - 3, gcut] - 13, NR [55.90%, 57.73%]  [30.25%, 41.78%]  [27.07%, 34.06%]  [27.95%, 34.06%] 
ngcutl - 12 R [34.73%, 44.38%]  [22.65%, 28.43%]  [22.65%, 28.43%]  [22.65%, 28.43%] 
B&W1 - 300, M&V1-200 NR [22.52%, 79.64%]  [12.87%, 38.11%]  [12.47%, 36.75%]  [12.67%, 36.75%] 
R [10.41%, 65.20%]  [8.82%, 36.43%] [8.62%, 36.43%] [8.68%, 36.43%] 
First set of 47 real-world NR [19.36%, 52.30%] — [14.98%, 36.06%]  [12.63%, 35.76%] — [13.22%, 36.06%] 
instances R [14.43%, 39.93%]  [10.67%, 25.35%]  [9.87%, 25.35%] —_ [9.55%, 25.35%] 
Second set of 121 real-world NR  [13.44%, 32.08%] — [13.84%, 31.39%] — [13.56%, 30.83%] — [13.66%, 31.30%] 
instances R [11.02%, 17.09%]  [9.20%, 16.30%] [8.64%, 16.30%] [8.56%, 16.30%] 


Note: NR: no rotation 1s allowed; R: rotation is allowed; [x, y]: x is the minimum percentage and y is the maximum 


percentage. 


(i.e., all of them are less than 5% except the case of 2-stage without trimming and rotation is allowed in 
instances cgcutl1—3, gcutl—13, ngcut1—12). This implies that the heuristic solutions obtained are quite 
close to the optimal solutions. Some heuristic solutions found are even optimal. Also, the minimum 
and maximum percentages found in the two sets of the real-world instances are generally larger than 
those of the two sets of the benchmark instances because the former ones are harder problems. Note that 
since the optimal solutions cannot be attained for the case of 2-stage without trimming and rotation is 
allowed in the first set of 47 real-world instances, the range of the relative gaps is not given in this case. 
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Moreover, it can be seen from Table 27.2 that both the minimum and maximum percentages of the 
ranges of averages of the percentage of waste obtained in the case of 2-stage without trimming are larger 
than those of the other cases. The reason is that it restricts the items’ heights to be the same as the height 
of the shelf in which they reside, and free spaces are wasted if no appropriate item that satisfies the con- 
straint can be packed into these spaces. In addition, both the minimum and maximum percentages of 
the ranges of averages of the percentage of waste found in the rotation case are smaller than those of the 
nonrotation case because rotation is more flexible for various items to fit empty spaces with different 
dimensions, and it turns out that fewer empty spaces are wasted. It is noteworthy that allowing trim- 
ming in the three stage problem reduces the waste only marginally. 

Table 27.3 gives the ranges of the sum of average and the average of standard deviation of computa- 
tional times with all possible combinations of parameters for all benchmark and real-world problems. 


TABLE 27.3 Ranges of the Sum of Average and the Average of Standard Deviation of Computational Times 
with All Possible Combinations of Parameters for All Benchmark and Real-World Problems (Unit: s) 


Horizontal Cut(s) at the First Stage 


2-Stage without 2-Stage with 3-Stage without 3-Stage with 
Instances Trimming Trimming Trimming Trimming 
cgcutl - 3, NR avg: [0.4, 1.7] avg: [0.3, 1.0] avg: [0.3, 1.1] avg: [0.3, 1.0] 
geutl - 13, sd: [0.005, 0.020] sd: [0.006, 0.007] sd: [0.006, 0.007] sd: [0.005, 0.006] 
ngcutl - 12 R avg: [0.3, 1.5] avg: [0.3, 1.0] avg: [0.3, 1.1] avg: [0.3, 1.1] 
sd: [0.006, 0.006] sd: [0.006, 0.007] sd: [0.006, 0.020] sd: [0.006, 0.017] 
B&W1 - 300, NR avg: (7.3, 30.0] avg: [5.8, 20.6] avg: [7.0, 21.3] avg: [6.6, 21.1] 
M&V1 - 200 sd: [0.006, 0.139] sd: [0.007, 0.022] sd: [0.007, 0.036] sd: [0.007, 0.025] 
R avg: [7.2, 29.1] avg: [6.2, 21.5] avg: [6.9, 23.0] avg: [6.5, 22.4] 
sd: [0.006, 0.118] sd: [0.007, 0.024] sd: [0.006, 0.034] sd: [0.007, 0.030] 
First set of 47 NR avg: [0.9, 32.2] avg: [0.7, 22.8] avg: [0.7, 27.1] avg: [0.7, 25.7] 
real-world sd: [0.006, 0.056] sd: [0.007, 0.068] sd: [0.006, 0.065] sd: [0.006, 0.080] 
instances R avg: [0.9, 35.1] avg: [0.7, 25.5] avg: [0.8, 30.5] avg: [0.7, 28.7] 
sd: [0.006, 0.075] sd: [0.005, 0.063] sd: [0.007, 0.073] sd: [0.006, 0.081] 
Second set of NR avg: [9.6, 1099.4] avg: [8.2, 1110.8] avg; [8.4, 1207.1] avg: [8.3, 1218.1] 
121 real-world sd: [0.012, 3.053] sd: [0.012, 3.596] sd: [0.013, 4.838] sd: [0.010, 4.025] 
instances R avg: (10.3, 1316.6] avg: [8.7, 1160.5] avg: [9.1, 1237.9] avg: [8.9, 1228.9] 
sd: [0.011, 3.173] sd: [0.011, 3.769] sd: [0.012, 3.617] sd: [0.012, 3.863] 
Vertical Cut(s) at the First Stage 
cgcutl - 3, NR avg: [0.4, 1.5] avg: [0.3, 1.0] avg: [0.3, 1.0] avg: [0.3, 1.1] 
gcutl - 13, sd: [0.005, 0.006] sd: [0.004, 0.006] sd: [0.005, 0.006] sd: [0.005, 0.006] 
ngcutl — 12 R___ avg: [0.3, 1.5] avg: [0.2, 1.0] avg: [0.3, 1.1] avg: [0.3, 1.1] 
sd: [0.006, 0.006] sd: [0.005, 0.006] sd: [0.005, 0.020] sd: [0.006, 0.006] 
B&W1 - 300, NR avg: (7.4, 29.3] avg: [6.2, 20.2] avg: [7.0, 21.8] avg: [6.7, 20.9] 
M&V1 - 200 sd:[0.008, 0.145] sd: [0.007, 0.028] sd: [0.006, 0.031] sd: [0.007, 0.024] 
R avg: [7.2, 28.7] avg: [6.2, 21.5] avg: [7.0, 23.0] avg: [6.5, 22.4] 
sd: [0.007, 0.123] sd: [0.007, 0.026] sd: [0.007, 0.032] sd: [0.006, 0.028] 
First set of 47 NR avg: [0.9, 25.4] avg: [0.7, 19.3] avg: (0.7, 23.2] avg: [0.7, 23.3] 
real-world sd: [0.007, 0.043] sd: [0.006, 0.037] sd: [0.007, 0.048] sd: [0.006, 0.057] 
instances R avg: [1.0, 31.7] avg: [0.7, 24.5] avg: [0.7, 28.0] avg: [0.7, 26.7] 
sd: [0.007, 0.056] sd: [0.006, 0.049] sd: [0.006, 0.058] sd: [0.006, 0.062] 
Second set of NR avg: [9.3, 1024.7] avg: [8.4, 1004.7] avg: [8.6, 1184.8] avg: [8.7, 1091.7] 
121 real-world sd: [0.009, 2.376] sd: [0.013, 2.633] sd: [0.010, 2.777] sd: [0.014, 2.521] 
Instances R avg: (10.1, 1123.1] avg: [8.7, 1736.6] avg: [9.1, 1841.4] avg: (9.2, 1862.0] 
sd: [0.010, 2.378] sd: [0.009, 2.878] sd: [0.010, 3.220] sd: [0.013, 3.158] 


Note: NR: no rotation is allowed; R: rotation is allowed; avg: average; sd: standard deviation; [x, y]: x is the 
minimum time and y is the maximum time. 
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The sum of average is obtained by summing the averages of computational times of 30 simulation runs 
for all instances. The average of standard deviation is obtained by averaging the standard deviations 
of computational times of 30 simulation runs for all instances. Considering all scenarios of two- and 
three-stage problems with and without trimming, and with and without rotation in the case of horizon- 
tal cut(s) at the first stage, the sums of averages for the set of cgcut, gcut, and ngcut instances, the set of 
B&W and M&vV instances, the first set of 47 real-world instances and the second set of 121 real-world 
instances are within 1.7, 30, 35.1, and 1316.6s, respectively. The former three is short because they are 
easier problems (i.e., the number of items ranges from 7 to 809 in the instances). The last one is opposite 
since they are hard problems (i.e., the number of items ranges from 32 to 10,710 in the instances), but it 
is still acceptable. Also, the averages of standard deviations are within 0.020, 0.139, 0.081, and 4.838 s, 
respectively. This shows that the computational times of those instances are quite stable and consistent. 
The magnitude of the sum of average and the average of standard deviation achieved in the case of verti- 
cal cut(s) at the first stage is similar to that in the case of horizontal cut(s). This is expected because only 
bin width and bin height, and item widths and item heights are exchanged in the former case, and the 
numbers of algorithmic operations implemented in both two versions are the same. 


27.5 Conclusions 


Several variants of two-dimensional guillotine bin-packing problems were addressed. For two-stage 
problems with trimming, one-phase and two-phase heuristics were reviewed. A greedy heuristic based 
on the definition of a packing sequence of the items and ona set of criteria to pack one item was described. 
Recently proposed deterministic and stochastic neighborhood structures, based on modifications on 
the current sequence of items, were presented. Since a solution is coded as a sequence and decoded by 
the constructive heuristic, the neighborhood structures are independent of the particular variant being 
addressed, which gives the approach the flexibility to deal with different variants of the problem (in 
particular, variants with two and three stages with and without trimming, with and without rotation). 

The computational tests revealed that the proposed heuristic approaches are able to find good- 
quality solutions within reasonable amounts of time for instances from the literature and for “real- 
world” instances. 

Using heuristics to address bin-packing problems in the wood-cutting industry is a robust strategy, 
because companies often face problems with different characteristics and constraints, which can be 
addressed with relatively minor amendments on the core version of the heuristic. 
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28.1 Introduction 


Recently, optimization techniques based on behaviors of some animal species in natural environment 
are strongly developed. Algorithms simulating behaviors of bees colony [1], ants colony [2,3], and 
birds flock [4] (fish school) have appeared. The last algorithm from those techniques is named in the 
literature as a particle swarm algorithm (in short PSO—particle swarm optimization), and is a new 
technique dedicated to optimization problems having continuous domain. However, its modifica- 
tions to optimize discreet problems [5] have been developed lately. The PSO algorithm has many 
common features with evolutionary computation techniques. This algorithm is operating on ran- 
domly created population of potential solutions, and is searching optimal solution through the 
creation of successive populations of solutions. Genetic operators like cross-over and mutation, 
which exist in evolutionary algorithms, are not used in the PSO algorithm. In this algorithm, poten- 
tial solutions (also called the particles) are moving to the actual (dynamically changing) optimum 
in the solution space. 

There exist two versions of PSO algorithm: local, LPSO, and global, GPSO, algorithm. In the LPSO, 
the process of optimization is based on velocity V changes of each particle P, moving toward position 
Pes Which corresponds to the best position of a given particle, and L,,,,, which corresponds to the best 
position of another particle chosen from Ne nearest neighbors of the particle P,, found up to the present 
step of the algorithm. In the GPSO algorithm, the process of optimization is based on velocity V changes 
(acceleration) of each particle P; moving toward position P,,,, and position G,,,,, which represents the 
best position having been obtained in previous iterations of the algorithm. 


28-1 
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The values of velocity in each D direction (the value of D is equal to the number of variables in optimized 
task) of solution space are computed for each particle P; using the formula 


¢ For LPSO algorithm 

VLA = VUA) + 1-15 (Pheself] — XL jl) + 2 -15,j “(Leese f]— XUj]) (28.1) 
« For GPSO algorithm 

VLA = VLA) + 61-5 - (Poestl fl — XTi) + €2 - 12,5 -(Goestl i] - XLII) (28.2) 


where j = 1, 2,..., D; c, and c, are coefficients of particle acceleration, usually positive values are chosen 
experimentally from the range [0, 2]; r,;, r, are random real numbers with uniform distribution from 
the range [0, 1], these values introduce random character to the algorithm. 

The main advantage of LPSO algorithm is lower susceptibility of solution to be “trapped” in local mini- 
mum, than in the case of GPSO algorithm. This advantage is a consequence of higher spread of solution 
values in the population in LPSO algorithm (larger part of the solution space is considered). However, the 
main advantage of GPSO algorithm is faster convergence than in the case of LPSO algorithm; it is caused 
by smaller spread of solution values in the population (smaller part of the solution space is covered). These 
two versions of PSO algorithm have been successfully applied to different disciplines of science during 
latest years [6,7]. This chapter is only an introduction to the problems connected with PSO. 


28.2 Particle Swarm Optimization Algorithm 


Generally, the PSO algorithm can be described using six following parts. 

In the first part, the objective function and values of all algorithm parameters such as M (number of 
particles in population), c,, c,, and Dare determined. Additionally, for the LPSO algorithm, an Ne value, 
which determines the number of the nearest neighbors for each particle, is chosen. After parameter 
selection, the population P, which consists M particles (potential solutions), is randomly created. In 
GPSO algorithm, also the vector G,,., is prepared. In this vector, the data representing the best posi- 
tion of particle (solution) found during algorithm operation are written down (at the start of the algo- 
rithm, vector [G,,,,] = [0]). Each particle P; (i-e., [1, M]) from population P is composed of the following 
D-element vectors. Vector X represents current position of the particle P; (solution of a given problem); 
vector P,,., represents the best position of the particle P, in the solution space found for this particle at 
a given step of the algorithm; vector L,,,, (only for LPSO) represents the best position of other particle 
from among Ne nearest neighbors of the particle P,, which has been obtained during previous iterations 
of the algorithm; and vector V represents the values of velocity of particle P; in each direction D of the 
solution space. At the start of the algorithm, the vector X is assigned to the vector P,,,,, and values of 
vectors V and L,,,, are cleared during creation of initial population. 

In the second part, each particle P; is evaluated using objective function. In the case of minimization 
tasks, when computed value of objective function for data written down in vector X of particle P; is lower 
than the best value found for this particle at a given step of the algorithm (the value written down in 
vector P,,,.,), then the values from vector X are written down in the vector P,,,,,. In the case of maximi- 
zation tasks, the values stored in vector X are written down in vector P,,,.,, if and only if the computed 
value of the objective function for data from vector X of particle P; is higher than the best value found 
for this particle at a given step of the algorithm. In the GPSO algorithm, updating of the vector G,,,., is 
performed after evaluation of all particles P,in the population P. If there exists a particle P;in population 
having lower value of the objective function (for minimization tasks) or having higher value of objective 
function (for maximization tasks) than the value of the objective function stored in the solution written 
down in the vector G,,,,, then the position of particle P; is written down in the vector G,,,,. 
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Procedure Particle Swarm Optimization | //--- for global version - al gorithmGSO 
Algorithm 
Begin 

Determine objective function and algo- | Update position witten down in vector Gu. 
rithm parameters 


If position of current particle is better 


Randomly create swarm (popul ation) P i 

conposed of M particles For each particle i=l, ... Mdo 

Repeat Begin 

For each particle i=l, ... Mdo Randomy determine the values of rj, 


peg? coefficients 
Eval uate the quality of particle posi- : . 
tion X (solution) using objective // --- for local version - algorithmLPSO 


function Update particle velocity V using fornula (1) 
Determine the best position X from} // --- for global version - al gorithm@so 


obtained until current step of the | Update particle velocity V using formula (2) 
algorithm and wite dowm this posi- 
tion in vector Pyest // --- for both versions - al gorithmLPSQ 
//--- for local version - algorithmLPSO | and @SO 


Update the position written down in | Update particle position X using formula (3) 
vector Lye, if the better position has 


End 
been found among Ne nearest nei gh- ae ined diti is fulfilled 
bours of current particle Until termination condition is fulfille 
End 


FIGURE 28.1 Particle swarm optimization algorithm in pseudo-code form. (Adapted from Engelbrecht, A. P., 
Computational Intelligence—An Introduction, 2nd edn., Wiley, Chichester, U.K., 2007.) 


In the third part (only for LPSO), the best particle among Ne nearest neighbors of the particle P; is 
determined for each particle P;. The neighborhood is determined based on positions of particular par- 
ticles in the solution space (e.g., using euclidean distance). The position of the best neighboring particle 
with respect to particle P; is written down in the vector L,,,, for particle P,, if and only if this position 
is better than the position (solution) actually stored in vector L,,,,. 

In the fourth part, the values of velocity in each D direction of solution space are computed for each 
particle P; using formula (28.1) for LPSO, and formula (28.2) for GPSO. 

In the fifth part, the vector X consisting position of particle P,, in D-dimensional solution space, is 
updated for each particle P; using velocity vector V according to the formula 


X[j] = XUj]+ VE (28.3) 


In the sixth part, a fulfilling of termination condition of the algorithm is checked. The termination 
condition can be the algorithm convergence (invariability of the best solution during the prescribed 
number of iterations) or assumed number of generations. If termination condition is fulfilled, then 
the solution written down in vector X of the particle having the lowest value of the objective function 
(in the case of minimization tasks) or particle having the highest value of the objective function (in the 
case of maximization tasks) is returned as a result of the LPSO algorithm operation. The solution written 
down in vector G,,,,, is a result of algorithm operation in the case of GPSO algorithm. When the termina- 
tion condition is not fulfilled, then again the second step of the algorithm is executed. 
In Figure 28.1, the algorithm PSO in pseudo-code form is presented. 


28.3 Modifications of PSO Algorithm 


There exist many modifications of PSO algorithm, improving its convergence. Among these modifications, 
we can mention velocity clamping, inertia weight, and constriction coefficient. 


© 2011 by Taylor and Francis Group, LLC 


28-4 Intelligent Systems 


28.3.1 Velocity Clamping 


During research on PSO algorithm, it was noticed that the update of the velocity vector V using formula 
(28.1) or formula (28.2) causes fast increase of values stored in this vector. As a consequence, the par- 
ticle positions are changed with increasingly higher values during algorithm operation. Due to these 
changes, new positions of particles can be located outside of the acceptable space for a given search 
space. In order to limit this drawback, constraint values of particle velocity V,,,,,,; are introduced in each 
jth dimension D of the search space. When computed value of the velocity V;; for ith particle, in jth 
dimension is higher than value V,,,,,., then the following formula is used: 


ax,)? 


Vij = Vinax,j (28.4) 


For small values of V, 


max,j? 


Of course, it is important to choose suitable values of V,,a,,j- the algo- 
rithm convergence time will be higher, and swarm can stick in local extreme without any chances 
for escape from this region of the search space. However, if the values of V,,,,,; are too high, then 
the particles can “jump” over good solutions, and continue the search in worse areas of the search 
space [8]. Usually, following formula (28.5) is used in order to guarantee a suitable selection of V,, 


values for jth variable: 


aXx,j 


Vinax,j > 3) ; (Xmax,j a Xmin,j) (28.5) 


where Xp.axj ANG Xypin are Maximum and minimum values, respectively, which determine the range of 
jth variable (j € [Xing Xmax,])» 8—is a value from the range (0; 1), which is determined experimentally 


for solved problem. 


28.3.2 Inertia Weight 


The inertia weight is introduced to the PSO algorithm in order to better control particle swarm ability 
of exploration (searching over the whole solution space) and exploitation (searching in the neighbor- 
hood of “good” solutions). Inertia weight coefficient w determines how a large part of the particle veloc- 
ity from the previous “fly” will be used to create its new velocity vector. The introduction of @ coefficient 
causes the following modifications of formulas (28.1) and (28.2), respectively: 


V[fl=O-Vifl+a-n;- (PresL{]— XLj]) + hy (Leestlf]— XTj]) (28.6) 
VIfJ=O-VUj]+ er -15 + (Poeself] — XLj]) + 2 -12,5 > (GoeseLf]— XU) (28.7) 


Many different techniques of determination of @ values exist in literature [8]. One of them is described 
by the following formula: 


1 
B2>-(ta)-1 (28.8) 


In this case, the choice of @ value depends on the selection of c, and c, values. In paper [9], it is shown 
that if formula (28.8) is fulfilled, then the algorithm will converge; in other case, the oscillations of 
obtained results can occur. Other methods of determination of w value can be found, for example, in 


paper [8]. 
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28.3.3 Constriction Coefficient 


In its operation, the constriction coefficient ¥ is similar to the inertia weight described in Section 28.3.2. 
The main task of constriction coefficient is balancing of the PSO algorithm properties between global 
and local searching of the solution space in the neighborhood of “good” solutions. Modified formulas 
(28.1) and (28.2) using the constriction coefficient X are as follows: 


VLA =X-[VEf}+ e175 - (Poel i] — XL jl) +62 °7,; “(Goel fl - XU) ] (28.9) 
VL l= %-[VEj) +6115 * (Poel fl - XU jl) + 2-1, -(LreuL fl - XL) | (28.10) 


and 


7 2-K 
. 2-o-Jo-(0=4)| (28.11) 


where ® = ¢,- 1; + C)- %3 usually, it is assumed that ® 2 4and Ke [0, 1]. 

Due to the application of constriction coefficient X, the PSO algorithm convergence is assured without 
necessity of using of the velocity clamping model. The X coefficient values are chosen from the range 
[0, 1] and, therefore the particle velocities are decreasing during each iteration of the algorithm. The 
value of the parameter k influences swarm abilities for global or local searching of the solution space. 
If « = 0, then faster algorithm convergence occurs together with local searching (this behavior is similar 
to the hill-climbing algorithm). In the case when xk ~ 1, the algorithm convergence is slower, and at the 
same time, searching of the solution space [8] is more exact. 


28.4 Example 


Minimize the following function: 


2 
FC= flxym)= > 24, —5.12< x, <5.12, Global minimum = 0 in (x), x,) = (0,0) 


i=l 


It is assumed that the PSO algorithm has following parameters: number of particles M = 5; c, = c, = 0.3. 
The dimension of the solution space (identical as a dimension of optimized function) is equal to D = 2. 
Two versions of the PSO algorithm are considered: local and global. Additionally, in LPSO, the number 
of nearest neighbors Ne = 2 is assumed. 


28.4.1 Random Creation of the Population P Consisting M Particles 
28.4.1.1 For LPSO 

It is assumed that the particles have following parameters: 

Particle P,: X = {3.12; 4.01}; Phos = X = {3.12; 4.01}; V = {05 O}; Lyes, = {03 OF 

Particle P,: X = {-2.89; -1.98}; Pj. = X = {-2.89; -1.98}; V = {0; 0}; Ly... = {03 0} 

Particle P;: X = {4.32; 2.11}; Pho = X = {4.325 2.11}; V= {0; 0}; Ly. = {03 0} 

Particle P,: X = {2.11; —2.12}; P,,.., = X = {2.11; -2.12}; V = {0; 0}; Lye = {03 0} 

Particle P;: X = {0.11; -2.71}; Pye. = X = {0.11; -2.71}; V = {0; 0}; Ly... = {0; 0} 
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28.4.1.2 For GPSO 


It is assumed that the created population is composed of particles similar to those in LPSO algorithm. 
The difference is that, the vector L,,,, does not exist in created particles P,, but instead, there exists the 
vector G,,,.)- 


28.4.2 Evaluation of Particle Positions Using Objective Function FC 
28.4.2.1 For LPSO 


The quality of positions is computed using objective function FC and data stored in the vector X for 
particular particles, the FC; values are 


FC, = 25.8145; FC, =12.2725; FC; =23.1145; FC,=8.9465; FC; =7.3562 


Since it is the first iteration of the algorithm, there is no position X of any particle P; that has lower value 
of the objective function than the position written down in its vector P,,,,. AS a result, none of the vectors 
P,,,, Will be updated. Of course, this update will occur in next steps of the algorithm. 


28.4.2.2 For GPSO 


The position X having the lowest value of the objective function FC (minimization task) is chosen and 
written down in the vector G,,.,. After this operation, considered vector is equal to G,,,, = X; = {0.11 
~2.71}. In next generations the vector G,,,, will be updated, only ifa particle in the population will have 
a better position X than the position X stored in the G,,,.,. 


28.4.3 Calculation of the Best Neighbors (Only for LPSO Algorithm) 


At the beginning of the operation of the algorithm, the vectors L,,,, of all particles are updated, because 
at the start, all vectors [L,,,,] = [0]. In next generations, the vector L,,,, of the particle P; will be updated 
only if the position X of the best particle from Ne nearest neighbors of particle P; has lower value of the 
objective function than position X actually stored in the vector L,,,, in particle P,. 

In order to compute Ne = 2 nearest neighbors for each particle P,, the distance between all particles in 
the population must be computed. The Euclidean distance d(A, B) between two points A = {a,, a, ..., ay} 
and B = {b,, b,, ..., b,} in n-dimensional space is determined as follows: 


d(A,B)=,! ¥(a.—b) (28.12) 


For example, the distance between X position of particles P, and P, is equal to 


d(P,,P;) = (3.12 - (-2.89))" +(4.01- (-1.98))" = /36.12014 35.8801 = 8.4853 


In a similar way, the remaining distances between particles are as follows: 


d(P,, P,) = 8.4853; d(P,, P;) = 2.2472; d(P,, P,) = 6.2126; d(P,, P;) = 7.3633; d(P,, P;) = 8.2893; d(P,, P,) = 
5.0020; d(P,, P;) = 3.0875; d(P, P,) = 4.7725; d(P,, P;) = 6.3997; d(P,, Ps) = 2.0852. 


© 2011 by Taylor and Francis Group, LLC 


Particle Swarm Optimization 28-7 


After the analysis of distances between particles, the Ne = 2 nearest neighbors are determined for each 
particle. For example, the particle P, is a neighbor of particles P, and P,, what is written as P, = >{P,, P,}. 
The two neighbor particles for the remaining particle P; are as follows: P, = >{P,, P;}; P; = >{P,, P,}; 
P,=>{P3, Ps}; P; = >{P, Py}. 

Next, only one particle of the two having the lowest value of the objective function is chosen from 
each pair of neighbor particles with respect to the particle P; then, the position X of chosen particle is 
written down in the vector L,,,, of particle P;. For example, the vector L,,., for particle P, is updated using 
position vector X of particle P,. If we perform the same computations for other cases, we obtained the 
following values of the vector L,,,, for the remaining particles: 


Py: Lpese = {2.113 — 2.12}; Py: Lyest = {0.11;-—2.71}; Py: Loose = {2.11; — 2.12}; 


Py: Lyese = {0.11; —2.71}; Ps: Lpese = {2.11; — 2.12} 


28.4.4 Calculation of New Values of Particle Velocity 
28.4.4.1 For LPSO 


The velocity vectors are computed for each particle using formula (28.1). For example, for particle P,, the 
computations are as follows: 


V[QJ =V [I] +0.3- ry - (Poesel1] — X[1]) + 0.3- 15 -(Lees[1] — XT) 


V[2] = V[2]+0.3- 7 - (Presel2] i X[2]) +0.3-19° (Leese [2] = X[2]) 


If it is assumed that for coefficients r,, the follownig real numbers are randomly chosen: r,, = 0.22, 
r,, = 0.76, r= 0.55, r,, = 0.21, we can obtain 
VI] =0 + 0.3 -0.22 - (3.12 — 3.12) + 0.3- 0.76 - (2.11 — 3.12) = —0.2303 
V[2]=0+0.3-0.55-(4.01—4.01) + 0.3 -0.21-(—2.12 — 4.01) = -0.3862 
If the same computations are performed for the remaining particles and the same set of random 
numbers are assumed for coefficients r;; (for simplification) r,, = 0.22, r,) = 0.76, 7,7 = 0.55, 1, = 0.21, 


the following vectors are obtained: 


PV = {-0.2303;— 0.3862}; P,: V = (0.6840; — 0.0460}; P,: V = {-0.5039; — 0.2665} 


P,: V = {-0.4560;— 0.0372}; Ps: V = {0.4560,0.0372} 


28.4.4.2 For GPSO 


The velocity vector is updated using formula (28.2) for each particle P,. For example, for particle P,, the 
new velocity vector is computed as follows: 


VO] =V[I] + 0.3 ry - (Poesel1] — X[1]) + 0.3 + 75, «(Gresr[1] - XT) 


V[2] = V[2]+0.3- 12 -(Presel2] — X[2]) + 0.3 - 7 2 -( Grese12] — X[2]) 
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if it is assumed, that for coefficients r,; following real numbers have been randomly chosen: r,, = 0.85, 
r, = 0.23, r, = 0.45, r,, = 0.63 it is obtained: 
V1] =0 +0.3-0.85 - (3.12 — 3.12) +0.3-0.23 -(0.11—3.12) =—0.2077 
V([2]=0+0.3-0.45-(4.01— 4.01) + 0.3 -0.63 -(—2.71— 4.01) =-1.2701 
For other particles the velocity vectors are as follows: 
Pz V = {-0.2077;-1.2701};  P;: V = {0.2070;—0.1380}; Py: V = {-0.2905;— 0.9110} 
P,: V = {-0.1380;—0.1115}; PB: V = {0, 0} 


The identical values of r, ;, 7,1, 71,2, 7,2 are assumed in order to simplify this example; in real algorithm, 
the new values of parameters r, ,, 7), 11,9» 7,. must be randomly chosen for each particle in each iteration 
of the algorithm. 

28.4.5 Calculation of New Values of Particle Position Vectors 


New values of position vectors X of particle P; are computed using formula (28.3) for both versions of the 
PSO algorithm: LPSO and GPSO. 


28.4.5.1 For LPSO 


For LPSO, new values of position vectors X for particular particles P, are computed using computed 
earlier velocity vectors V. For example, the new position X for particle P, is equal to 


X(Y=X(Y+Vj;  X[2] = X[2]+ V[2]; 


therefore new values are 


X[1] = 3.12 + (-0.2303) = 2.8897; X[2]=4.01+ (—0.3862) = 3.6238. 


Thus, the new position vector X for particle P, is equal to X = {2.8897; 3.6238}. 
For other particles, new position vectors X are as follows: 
P: X = {2.8897;3.6238}; Py: X = {-2.2060; — 2.0260}; Py: X = (3.8161; 1.8435}; 


Py X = {1.6540;— 2.1572}; Ps: X = {0.5660; — 2.6728} 


28.4.5.2 For GPSO 


For GPSO of the algorithm, new values of position vectors X for particular particle P; are as follows: 
Pz X = {2.9123;2.7399}; Py: X = {-2.6830; —2.1180}; Py: X = {4.0295; 1.1990}; 


Py: X = {1.9720;— 2.2315}; Ps: X = {0.1100; — 2.7100} 


In summary, the population of particles for LPSO of the algorithm after first iteration is as follows: 


Particle P,: X = {2.89; 3.63}3 Prog = {3.125 4.01}; V = {-0.23; 0.39}; Lyog = (2.11; -2.12} 
Particle P,: X = {-2.21; -2.03}; Pyog = {-2.89; -1.98}; V = {0.68; —0.05}; Ly = {0.11; -2.71} 
Particle P,: X = {3.825 1.84}; Prog = {4.323 2.11}; V = {-0.50; —0.27}; Lyog: = {2.115 -2.12} 
Particle Py: X = {1.65; -2.16}; Pyog: = {2.115 -2.12}; V = {-0.46; —0.04}; Lyogy = {0.11; -2.71} 
Particle P,: X = {0.573 -2.67}; Peg = {0.11; -2.71}; V = {0.463 0.04}3 Lyogy = {2.11; -2.12} 
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If we compute values of objective function FC for data stored in X vectors for all particles, it can be 
noticed that the best result is written down in vector X in particle P, for which the value of objective 
function FC is lowest, and equal to FC, = 7.3881. 

Similarly, the population of particles for GPSO of the algorithm after first iteration is as follows: 


Particle P,: X = {2.91; 2.74}; Pros = (3.12; 4.01}; V = {-0.21; -1.27}; 
Particle P,: X = {-2.68; -2.12}; P,,.. = {-2.89; -1.98}; V = {0.21; -0.14}; 
Particle Py: X = {4.03; 1.20}; Py. = {4.325 2.11 V = {-0.29; —0.91}; 
Particle P,: X = {1.975 -2.23}5 Pree = {2.113 -2.12}; V = {-0.14; -0.1]}; 
Particle P,: X = {0.11; -2.71}; P,,,, = {0.11; -2.71}; V = {0; 0}; 

Vector G,,., = {0.11; -2.71}; 


In the last step, the termination condition of the algorithm is checked. In the case when termination 
condition is fulfilled, for LPSO algorithm, the result (solution) having the lowest value of the objective 
function in current population of particles is returned, or for GPSO algorithm, the result stored in the 
vector G,,,, is returned. If termination condition of the algorithm is not fulfilled, then the algorithm 
jumps to the evaluation of particle positions in the population (Section 28.4.2), and whole process is 
repeated. 


28.5 Summary 


In this chapter, a basic version of the PSO algorithm is presented. Two versions of this algorithm: LPSO 
and GPSO are described. Also, some modifications of PSO algorithm, which improve its convergence, 
are shown. The PSO algorithm operation is illustrated in detail by examples of minimization of two 
variable functions. 
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29.1 Introduction 


In the real-world, engineers often need to solve difficult optimization problems, such as digital filters 
designed with non-standard characteristics, floor planning of 2D elements, decomposition of digital 
circuit on subcircuits, etc. The majority of these problems are classified as NP-hard problems in which 
the solution space is very huge, and algorithms that can find an optimal solution in acceptable com- 
putational time do not exist. In these cases, the engineers must use heuristic methods, which do not 
guarantee that the solution found will be optimal. However, these heuristic methods can find accept- 
able suboptimal solutions in all required computational time. Among these methods, a special place is 
reserved for evolutionary algorithms [1,2]; their main advantages are as follows: the computation starts 
from a population (many points) of potential solutions (not from a single potential solution), and only 
a proper objective function, which describes a given problem is required (any other information and 
derivative of objective functions are not required). The existence of different genetic operators (like 
mutation, cross-over) in these algorithms allows easy “escape” from actual local minimum that can 
be found during computations, and prevents premature convergence of the algorithm. Evolutionary 
computation is widely used in real-world applications, such as digital filters design [3], design and opti- 
mization of digital circuits [4,5], partitioning of VLSI circuits on subcircuits with minimal number of 
external connections between them [6], optimization of placement of 2D elements [7], training of artifi- 
cial neural networks, and optimization of parameters in grinding process. 


29.2 Description of Evolutionary Algorithms 


Algorithms in which the way of finding a solution, i.e., searching for potential solutions space (the 
way of information processing), is based on natural evolution process and Darwin’s theory of natural 
selection (only individuals with the best fitness to the environment will survive in next generation) 
are named as evolutionary algorithms. The notions, which are used for description of parameters 
and processes in evolutionary algorithms have close union with genetics, and natural evolution. 
In general, the evolutionary algorithm processes the population of P individuals, each of which is 
also named a chromosome and represents one potential solution of the given problem. Evolutionary 
algorithm is operating in an artificial environment, which can be defined based on the problem 


29-1 
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Genotype Phenotype 
ESE eReS 
(a) 


Genotype = phenotype 


22.85 | 10.43 | 32.15 | 66.21 | 11.34 
(b) 


FIGURE 29.1 Example of individual with representation: binary (a) and real-number (b). 


solved by this algorithm. During the operation of the algorithm, individuals are evaluated and those, 
better fitting to the environment, obtain better so-called “fitness value.” This value is the main fac- 
tor of evaluation. Particular individual consists of coded information named as a genotype. The 
phenotypes, which are the decoded form of potential solutions of a given problem, are created from 
genotypes, and are evaluated using fitness function. In some kinds of evolutionary algorithms, the 
phenotypes are identical to genotypes. A genotype is a point in the space of codes, while the phe- 
notype is a point in the space of problem solutions. Each chromosome consists of elementary units 
named genes (in Figure 29.1, both individuals have five genes). Additionally, the values of a particu- 
lar gene are called allels (e.g., in the case of binary representation, the allowed allele values are: 0 and 
1). In Figure 29.1a, an example of individual with binary representation (genotype and phenotype) is 
presented, and in Figure 29.1b, an individual with real-number representation is shown (the pheno- 
type is identical as the genotype). 

The environment can be represented using fitness function, which is related to the objective function. 
The structure of evolutionary algorithm is shown in Figure 29.2. 

It can be seen, that evolutionary algorithm is a probabilistic one in which new population of individu- 
als P(t) ={x{,...,x;} is generated in each iteration t. Each individual x; represents possible solution of 
considered task, and is most often represented by data structure in the form of a single-layer chromo- 
some [2] (however in paper [6], and especially in [8], a concept of multilayer chromosome is introduced 
together with its possible potential applications). Each solution x; is evaluated using certain measure of 
fitness of chromosome. Therefore, the new population P(t + 1) (in iteration t + 1) is created by selection 
of the best fitted individuals (selection phase). Additionally, in the new population, some individuals 
are transformed (exchange phase) using “genetics” operators, leading to the creation of a new solution. 
The transformations can be represented by mutation operator, in which new individuals are generated 


Procedure of evolutionary algorithm 
begin 
determine fitness function 
t <0 
randomy create individuals in initial popul ation P(t) 
eval uate individuals in population P(t) using fitness function 
while (not terminate criterion) do 
begin 
t-t +1 
performsel ection of individuals to population P(t) from P(t - 1) 
change individuals of P(t) using cross-over and nutation operators 
eval uate individuals in population P(t) using fitness function 
end 
end 


FIGURE 29.2 Pseudo code of evolutionary algorithm. (Adapted from Michalewicz, Z., Genetic Algorithms + Data 
Structures = Evolution Programs, Springer-Verlag, Berlin/Heidelberg, Germany, 1992.) 


© 2011 by Taylor and Francis Group, LLC 


Evolutionary Computation 29-3 


by a small modification of a single individual, and cross-over operator, where new individuals in the 
form of single-layer chromosome are created by linking the fragments of chromosome from several (two 
or more) individuals [2]. These operators are described in Section 29.2.5. After several generations, the 
computation converges, and we can expect that the best individuals representing acceptable solution are 
located near the optimal solution. 


29.2.1 Fitness Function 


The fitness function in evolutionary algorithm is an element linking the considered problem and 
the population of individuals. The main task of this function is a determination of qualities of 
particular individuals (solutions) in the aspect of problem to be solved. In the case of problems 
related to minimization of objective function, the solutions having the smallest value of fitness 
function will be better solutions, while in the case of maximization problem, these solutions will 
be the worst ones. In evolutionary algorithms, typically, the maximization of objective function is 
considered. Therefore, in the case of minimization problem, we must convert the problem to maxi- 
mization task. The simplest way to do this is a change of the sign of the objective function (multiply 
by -1), and assure positiveness of this function for all values of input arguments. This is necessary 
because selection based on roulette method, typically used in classical genetic algorithms, requires 
nonnegative fitness values for each individual. The solution of this problem is a suitable definition 
of the fitness function (FF) based on the objective function (OF). The following formula can be 
used for minimization tasks [9]: 


F(x!) = fnax — OF (x1) (29.1) 


Of course, the value f,,,,, (the highest value of objective function) is usually not known a priori, therefore 
the highest value of objective function obtained during all previous generations of algorithm is assigned 
as a finax Value. Additionally, if we want to obtain fitness function values only in the range (0; 1], then the 
definition of fitness function for minimization problems is as follows [9]: 


FF (x!) : 


an OF (x!) — fri oo 


where f,,,;, is the lowest value of objective function observed during previous iterations of the algorithm. 
In the case of maximization tasks, the value of fitness of individual is scaled as follows [9]: 


1 
1+ finax — OF (x!) 


FF (x!) = (29.3) 


29.2.2 Representation of Individuals—Creation of Population 


Depending on the problem, typically used representations of individuals are binary, real-number, and 
integer-number. In the case of binary representation, the determination of the number of genes (NG) 
required for the coding of a given variable V € [%jini Xmax] With assumed precision B is very important. 
In this case, the following inequality must be fulfilled: 


2NG > Xmax ~ Xmin “i (29.4) 


B 
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In the case of coding in the form of real-number or integer-number, the total number of genes in an 
individual is equal to the number of variables in an optimized task. During the creation of a population 
consisting of M individuals having NG genes in each chromosome, the value of each gene is randomly 
selected from an assumed range. In the case of binary coding, each gene has value “0” or “1,” and in the 
case of real-number coding or integer-number coding, the value represented by particular gene is 
randomly chosen from the range [Xsin} Xmaxls determined individually for each variable. 


nish 


29.2.3 Evaluation of Individuals 


In the case of binary representation of individuals, the fitness value can be evaluated for particular indi- 
vidual using defined fitness function and values for each individual. The phenotype can be computed 
from genotype using the following formula: 


phenotype = X;nin + ag — -dec(genotype) (29.5) 


where dec(-) represents the decimal value corresponding to the chosen genotype. 

In the case when genotype consists information related to several variables (located in several parts 
of the chromosome—see Figure 29.3a), the value of phenotype is computed for each variable separately 
using that part of genotype where this variable is written down (see Figure 29.3). 

For individual representation in the form of real-number or integer-number, the genotype is identical 
to phenotype, and phenotype computing is not required (see Figure 29.1b). If we have the phenotype ofa 
particular individual, then it is possible to compute a value of fitness function for each individual, which 
determines the quality of a given individual. 


29.2.4 Selection 


The selection, also named as reproduction, is a procedure of choosing given individuals from the 
population in order to create a new population in the next generation of evolutionary algorithm. The 
probability of selection of a given individual depends on its fitness value. When the given individual 
has a higher fitness value, then it possess higher chance to be selected to the new generation. The repro- 
duction process is strictly connected with the two most important factors in evolutionary algorithms: 
preservation of the diversity of population and selection pressure. These factors are dependent on each 
other because increase of selection pressure causes decrease of population diversity (and inversely) [1]. 
Too high value of selection pressure (concentration of the search only on best individuals) leads to pre- 
mature convergence, which is an undesirable effect in evolutionary algorithms, because the algorithm 
can stick in local extreme. However, too small value of selection pressure causes that search of solution 
space has almost random character. The main goal of the selection operators is the preservation of bal- 
ance between these factors [2]. There exist many selection methods. The oldest one (most popular) is a 
proportional selection also named as a roulette selection. In this method, the probability of individual 
selection is proportional to the value of its fitness function [1]. For each individual, the sector size on 
roulette wheel is equal to the individual relative fitness (rfitness) value, that is, the fitness value divided 
by the sum of all fitness values GF (global fitness) in the population (see formula (29.6) and (29.7)). In 
Figure 29.4, an example of roulette wheel with scaled sectors for M = 5 individuals is presented. 


< x1 pie «2 pi< x3 >i xl «2 x3 


1] oo rfifofififirjo] 4 [13 | 6 
(a) (b) 


FIGURE 29.3 Genotype with coded 3 variables: x1, x2, and x3 (a), phenotype corresponding to it (b). 
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x 
Population Fitness Relative fitness 


Individual 1 || 56 P| 0.280 | 
[_Individual2 J ow | P| 0.060] 

Individual 3 || 43 p{ 0.215 | 0.280 
[_Individual4 2 ae P| 0.335. | 


Individual 5 22. ->7—P) 0.110 
> cs 


( Global fitness = 200 ) 


Roulette wheel 


FIGURE 29.4 Example of roulette wheel with scaled sectors for M = 5 individuals. 


To use the roulette selection, we must compute global fitness (GF) for the whole population: 
M 
GF = » fitness; (29.6) 
i=l 


where 
fitness; represents fitness value for ith individual in population 
M is the number of individuals in population 


Then, the value of relative fitness (rfitness) is computed for each ith individual: 


rfitness; = fi a. (29.7) 


The value of relative fitness rfitness, represents the probability of the selection of ith individual to the 
new population (probability of the selection of the individual is higher for those having higher rou- 
lette sector). Next, the sector ranges on roulette wheel must be determined for particular individuals. 
Roulette sector is equal to [min,; max,) for ith individual. The border values are computed as follows: 


min; = max;_, (29.8) 
max, = min, + rfitness; (29.9) 


For the first individual the value min, = 0 (see Figure 29.4). 

In the next step, a random value from the range [0; 1) is chosen M - times in order to select M individuals to 
the new population. If randomly chosen value is inside the range [min; max), then ith individual is selected 
to the new population. 

Besides roulette selection, there exist many other selection methods. Among those, we can mention 
rank selection [2], tournament selection [2], and fan selection [10]. 

The roulette selection method described above could be equipped with mechanisms putting atten- 
tion on the survival of better individuals. The most known of them is the elitist model, in which the 
best individual is introduced to the next generation with the omission of the standard selection pro- 
cedure [2]. It is performed in the case when the best individual (with highest value of fitness function) 
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does not survive in the next population. In such a case, the worst individual in the population is 
replaced by the best one from earlier population. 


29.2.5 Mutation and Cross-Over 


The selection procedure is the first step in the creation of new generations in evolutionary algorithms. 
However, the new created individuals are only duplicates of individuals from previous generation; there- 
fore, the application of genetic operators is necessary in order to modify chosen individuals. In general, 
the operators used in evolutionary algorithms can be divided into two groups. First of these groups is 
one-argument operators, that is operating on a single individual. One-argument operators are called 
mutation, and are executed on the population of genes with probability PM € [0; 1]; this value is one of 
the parameters of the algorithm. Mutation depends on the random selection of real-number rand, from 
the range [0; 1) for each gene in population. If randomly chosen real-number rand,, < PM for jth gene in 
ith individual, then this gene is mutated. The scheme of simple mutation (for binary representation of 
individuals) depends on value exchange in chosen gene from “0” to “1,” or inversely. In Figure 29.5a, the 
scheme of simple mutation is shown. In the real-number representation of individuals, the procedure 
of mutation is analogical as in binary representation, but new value of gene is randomly chosen from 
assumed range for each variable (see Figure 29.5b). 

The second group of genetic operators is multi-argument operators named recombination or cross- 
over. In evolutionary algorithm, the cross-over operation is operating on the population of individuals 
with probability PC € [0; 1]; it depends on the random choice of real-number rand, from the range 
[0; 1) for each individual. In the case when rand, < PC for ith individual, this individual is chosen for 
cross-over operation. In evolutionary algorithms, the simplest model of cross-over is a simple cross-over 
operator, also named as a one-point cross-over. In Figure 29.6a, the scheme of the one-point cross- 
over with crossing point equal to “K1” is graphically shown for binary representation of individuals 
(in Figure 29.6b), the scheme of one-point cross-over is presented for real-number representation of 
individuals). In general, two child individuals are created from two parent individuals using cross- 
over operator. However, in evolutionary algorithms, many other types of cross-over operators are used. 
‘These operators have been created in order to provide an effective exchange of information between two 
chromosomes. Usually, the types of the cross-over operators are suitably chosen to the solved problem. 
The examples of typical recombination (cross-over) operators that are dependent on the problem are 
PMX (partially mapped cross-over), CX (cycle cross-over), OX (order cross-over) [4], which are used to 
solve the traveling salesman problem. 


Individual before mutation Individual after mutation 
1 0 1 1 0 1 0 0 1 0) 
| t 
(a) Gene chosen to mutate 
Individual before mutation Individual after mutation 
22.85 10.43 32.15 66.21 11.34 22.85 38.12 32.15 66.21 11.34 


Gene chosen to mutate 


(b) 


FIGURE 29.5 Scheme of simple mutation operator for individual with binary representation (a) and real-number 
representation (b). 
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Individuals before crossover Individuals after crossover 
1 0 1 1 0 1 0 1 1 1 
1 il 1 1 1 1 1 1 1 0 
K1 
(a) 
Individuals before crossover Individuals after crossover 
22.85 10.43 32.15 66.21 11.34 22.85 10.43 32.15 12.23 51.43 
13.32 43.21 33.11 12.23 51.43 13.32 43.21 33.11 66.21 11.34 
K1 


(b) 


FIGURE 29.6 Scheme of one-point cross-over operator for individuals with binary representation (a) and 
real-number representation (b). 


29.2.6 Terminate Conditions of the Algorithm 


The terminate conditions, which are mostly used in evolutionary algorithms are: algorithm conver- 
gence, that is invariability of the best solution after assumed number of generations, or reaching the 
assumed number of generations by the algorithm. 


29.2.7 Example 


Minimize objective function OF having three variables: 


3 
OF = f(x,,X2X3) = Se —5.12< x, <5.12, Global minimum = 0 in(x;,x,, x3) = (0,0,0) 


i=1 


The parameters of evolutionary algorithm are M = 5, PM = 0.05, PC = 0.5. For better problem presenta- 
tion two kinds of representation of individuals: binary and real-number are considered. It is assumed, 
that each variable must be coded with precision B = 0.2 for binary representation. Therefore, the lowest 
value of NG (number of genes representing one variable), which fulfills inequality (29.4) 


NG 5 5.12+5.12 
0.2 


QNG > Xmax ~%min 4) > 9 +1=>2NF > 52.2 


is equal to 6; therefore, each individual will be composed of 18 genes (6 genes for each variable). 


29.2.7.1 Determination of Fitness Function 


In order to guarantee nonnegative values of fitness function for all values of input arguments, the fitness 
function FF is scaled according to the formula (29.2)—minimization task: 


) 1 
~ 14 OF (x1) = finn 


FF (x! 


This scaling will be performed during the process of selection of individuals. Of course, the process of 
scaling is connected with the application of roulette selection method in the evolutionary algorithm. In 
the case of other selection method, the above transformation is not required. 
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29.2.7.2 Random Creation of Population of Individuals O,; 


For binary representation (each variable is represented using six genes) 
O, = {001011000111110011}; O, = {000111111110001100}; O, = {001111100110010010}; 
O, = {010101100110101110}; O, = {001111001110000101} 


For real-number representation (each variable is represented using one gene) 
O, = {2.13; -0.56; -3.56}; O, = {4.11; 2.01; -1.96}; O, = {0.95; 0.43; -0.87}; 
O, = {1.12; -3.54; 1.67}; O5 = {-0.87; 3.44; 2.55}. 


29,2.7.3 Evaluation of Individuals 


For binary representation 
The phenotypes of individuals are computed using formula (29.5); for example, for the first variable 
from individual O,, the computational process is as follows: 


phenotype = Xinin + = ™ . dec(genotype) 
5.12+5.12 
phenotype = —5.12 + 5 -dec(001011) = —3.33 


For other individuals, phenotype values are as follows: 


O, = {-3.33; -3.98; 3.17}; O, = {-3.98; 4.96; -3.17}; O3 = {-2.68; 1.06; -2.19}; 
O, ={-1.71; 1.06; 2.36}; O; = {-2.68; -2.84; —4.31}; 


The values of objective function OF, for ith individuals are 


OF, = (—3.33)" + (-3.98)" + (3.17)? =36.98; OF, =50.49; OF, =13.10; OF, =9.62; OF, =33.82 


For real-number representation 
The phenotype values are equal to genotype ones and, therefore, additional computations are not 
required; the values OF, for ith individuals are as follows: 


OF =17.52; OF) =24.77; OF, =1.84; OF, =16.57; OF, =19.09 


29,2.7.4 Selection 


The performance of selection process is identical for binary and real-number representation of indi- 
viduals. Therefore, the selection of individuals only in real-number representation is considered. 


For real-number representation 
The values of fitness function FF, are computed for ith individuals according to formula (29.2). The 
lowest value of objective function OF obtained in previous generations is assumed as a value of f,,:,3 
thus f,,,;,, = OF; = 1.84. Asa result, the following values of the fitness function are obtained: fitness, = 
FF, = 0.0599, fitness, = FF, = 0.0418 fitness, = FF, = 1, fitness, = FF, = 0.0636 and fitness, = FF, = 0.0548. 
The value of total (global) fitness of individuals computed according to formula (29.6) is equal 
to GF = 1.2201. The values of relative fitness (rfitness) computed for particular individuals, using 
formula (29.7), are 


rfitness,; =0.049; rfitness, =0.034;  rfitness,; =0.820; fitness, =0.052; rfitness; = 0.045. 
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We can see that individual 3 has the highest chance to survive, and individual 2 has the lowest 
chance to survive. In the next step, it is possible to construct a roulette sector for each individual 
using relative fitness values (see formula (29.8) and (29.9)). These roulette sectors are as follows: 


O, =>[0; 0.049); O, =>[0.049; 0.083); O; =>[0.083;0.903); O, =>[0.903;0.955); O,; => [0.95551) 


After roulette wheel is scaled, then five real-numbers from the range [0; 1) are randomly chosen 
as, for example, 0.3, 0.45, 0.8, 0.96, 0.04, therefore, individuals {O;; O3; O3; O;; O,} are selected to the 
new population. 

In the next step, the new selected individuals are mutated and crossed-over (see Section 29.2.5, 
and Figures 29.5 and 29.6). After these operations, the solutions (individuals) are evaluated as in 
Section 29.2.7.3, and the whole process is repeated until the termination condition of the algorithm 
is reached. 


29.3 Conclusions 


In this chapter, fundamental information concerning evolutionary algorithms is presented. The evolution- 
ary algorithms are widely used in many optimization tasks. At the present time, many different kinds of 
these algorithms, which are used in multimodal optimization, multi-objective optimization, optimization 
with constraints have been developed. In this chapter, only a few characteristics of evolutionary algorithm 
together with examples presenting the successive steps of evolutionary computation are described. 
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30.1 Introduction 


Data mining has been attracting increasing attention in recent years. Automated data collection tools 
and major sources of abundant data ranging from remote sensing, bioinformatics, scientific simulations, 
via web, e-commerce, transaction and stock data, to social networking, YouTube, and other means of 
data recording have resulted in an explosion of data and the paradox known as “drowning in data, but 
starving for knowledge.” 

The Library of Congress had collected about 70 terabytes (TBs) of data through May 2007; by February 
2009 this had increased to more than 95 TBs of data [LOC 98TB]. On July 4, 2007, the National Archives 
of Britain issued a press release announcing a memorandum of understanding (MoU) with Microsoft 
to preserve the UK’s digital heritage, an estimated archive content of “580 TBs of data, the equivalent 
of 580 thousand encyclopaedias” [NAB 580TB]. Today’s ubiquitous consumer electronics and mobile 
devices reveal astonishing jumps in storage capacity. In late 2008, TiVos, iMACs, Time Capsule, and 
various external hard disk drives regularly offered 1 TB storage capacity. Musical players such as iPods, 
personal digital assistants (PDAs), and pocket PCs offered storage options for hundreds of megabytes of 
multimedia material. 

Hard disk drives technology have made a giant leap since their commercial inception in the 
mid-1950s [Hoagland 03], going from 5.25 in. drives to 3.5 in. drives, reducing the number of plat- 
ters and heads while at the same time increasing the areal density (amount of data per square inch 
of media). In September 2006, the major innovator in disk drive technology, Hitachi Global Storage 
Technologies (Hitachi GST) based at the San Jose Research Center in San Jose, CA (previously IBM 
Storage Technology Division), demonstrated the 345 Gbits per square inch recording density, based on 
known but abandoned perpendicular recording technology (replacing the conventional longitudinal 
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recording) [Hitachi PR,Hoagland 03]. By 2009, Hitachi has predicted that new technology would result 
in a two-terabyte (TB) 3.5-in. desktop drive, a 400-gigabyte (GB) 2.5-in. notebook drive or a 200-GB 
1.8-in. drive (such is the one used in Apple iPods) [Hitachi PR,Hitachi SC]. These quickly growing and 
vastly available massive data sets have imposed a next logical necessity—the need for automated analysis 
of these massive data sets. 


30.2 What Is Data Mining? 


Data mining occurred as a natural evolution of database (DB) technology [Han 06]. Primitive file pro- 
cessing systems and database creation in the 1960s quickly evolved into powerful database systems in 
the 1970s, including hierarchical, relational databases, SQL query languages, and high speed transac- 
tion types called on-line transaction processing (OLTP) methods. The mid-1980s saw the introduction 
of advanced databases, leading to data warehouses (repositories of multiple heterogeneous data at a 
single site facilitating decision making) and on-line analytical processing (OLAP) techniques capable 
of functional techniques such as summarization, consolidation, and aggregation. Spatial, temporal, 
multimedia, web, and text mining became available with more in-depth analysis and sophisticated 
techniques of machine learning, pattern, time-series, and social data mining (Figure 30.1). As the 
growth of data continues, commercial tools are becoming more updated and aimed at solving diverse 
problems ranging from marketing, sales, and business intelligence, to counterterrorism and social 
networking [Thuraisingham 2003,Berry 04,Linoff 2002,Matignon 07]. 

Although different definitions of data mining exist in literature, the simplest designation that puts all 
of the above together is that data mining is an intelligent process of mining knowledge from large amounts 
of data. Despite referring to mining knowledge from massive amounts of data, “data mining” carries 
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FIGURE 30.1 Data mining as fusion of disciplines. 
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FIGURE 30.2 Steps of knowledge discovery process. 


an incorrect designation—we are really mining knowledge from data (such as mining gold from rocks, 
which is referred to as gold mining, not rock mining). Probably the most correct naming is “knowledge 
mining from data.” Because of its length, it never gained popularity, and applies similarly to expressions 
such as knowledge extraction, data archeology, and data dredging. Perhaps, the only other term that has 
gained popularity similar to data mining is knowledge discovery from data (KDD), data to knowledge 
(D2K), knowledge and data engineering, or the combination of data mining and knowledge discovery 
[FD2K fi, IBM kdd,UCI D2K,ELSE DKE,IEEE TKDE,Springer DMKD,Babovic 01,Babovic 99]. 

The phases of the typical knowledge discovery process (Figure 30.2) can be described by three main 
phases. These phases are data integration (DI), OLAP, and front end, knowledge presentation tools. The 
first phase, data integration, entails data preprocessing such as data cleaning, integration, selection, 
and transformation. While data cleaning pertains to removal of noisy, inconsistent data, data integra- 
tion refers to disparate data sources merging (from Oracle to flat Ascii). Data selection relates to task- 
applicable data retrieval, and transformation involves data conversion into a format adequate for the 
next step, data mining. This phase is popularly known as ETL (extraction, transformation, loading), 
and results in the creation of data marts (DM) and data warehouses (DW). Data warehouses are large 
data repositories composed of data marts. The final phase is front-end tools often referred to as business 
intelligence (BI) portal type tools. 

Data mining represents tasks ranging from association rules and regression analysis, to vari- 
ous intelligent and machine learning techniques such are neuro-fuzzy systems, support vector 
machines, and Bayesian techniques. Through further pattern evaluation, the selection of knowl- 
edge-representative mined data instances is further visualized in the final, knowledge presentation 
phase. While data mining is clearly only a phase in the entire process of knowledge discovery, the 
data mining designation has been widely accepted as a synonym for the whole data to knowledge 
process [Han 06]. 

Throughout all of the phases of the knowledge discovery process, the constant reference to metadata 
is maintained. Metadata contains the data about the data such as data names, definitions, time stamp- 
ing, missing fields, and structure of the data warehouses (schemas, dimensions, hierarchies, data defini- 
tions, mart locations). 


30.3 OLAP versus OLTP 


As the operational databases are being targeted at different use and applications, performance rea- 
sons imply that they are kept separately from data warehouses. Transaction-oriented OLTP systems 
are typically responsible for known operations, such as daily searching for particular records, higher 
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FIGURE 30.3 OLTP versus OLAP systems. 


performance, and availability of flat relational, or entity-relationship (ER) types of transactions. These 
are online transaction and query processing operations, traditionally used for inventory or payroll 
(accounting type of processing). 

Unlike the operational, OLTP database systems, data warehouses provide OLAP for an interactive, 
various granularity analysis of multidimensional data stored in n-dimensional data cubes. OLAP sys- 
tems provide higher flexibility in data analysis and decision support capabilities, more complex than 
the operational processing offered by OLTP systems. The comparison between the two systems is given 
in Figure 30.3. 


30.3.1 Data Cubes 


With data warehouses and OLAP data analysis, the emphasis is put on multidimensional data cubes 
(DCs). DCs are multidimensional entities that offer a means of data modeling and processing. Data 
cubes are defined by dimension tables and fact tables. Dimension tables contain attributes that are 
entities used to organize and group data. Fact tables are numerical measures (quantities) of dimensions 
in dimension tables. For example, Companyl would like to create a sales data warehouse with the 
following dimensions: time, item, and state. The fact table contains the measures (facts) of the sales data 
warehouse, as well as keys to each of the related dimension tables. For example, measures (facts) of the 
sales data warehouse could be $SaleAmount, and SoldUnits. A 3D table and 3D DC representation is 
given in Figures 30.4 and 30.5. 
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FIGURE 30.4 3D view of web based sales. 
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FIGURE 30.5 3D data cube web based sales. 


Naturally, an n-dimensional DC can be composed of a series of (m — 1)-dimensional DCs: 
DC": = series of DC” (30.1) 


or, as illustrated by Figure 30.6: 

In a 4D cube, each possible subset of four dimensions will create a DC called a cuboid. All possible 
cuboids form a lattice of cuboids, each representing different levels of summarization, or “group by” SQL 
command (Figure 30.7). 

The cuboid on top (apex, 0D cuboid, all), represents the apex cuboid, i.e., summarization over all 
four dimensions (highest summarization level). The cuboid at the bottom of the lattice (Figure 30.7) 
represents the base cuboid, i.e., a 4D cuboid for the four given dimensions (lower summarization level). 
Figure 30.8 illustrates various aggregations of quarters, parts, states, and finally by all. 


30.3.2 OLAP Techniques on Data Cubes 


Data warehouses (DWs) represent repositories of heterogeneous, disparate data sources (for instance MS 
Access, Excel, SQL, mySQL), stored at a single site under unified schema. Data warehouses result from 
data preprocessing (data cleaning, integration, selection, and transformation) tasks with the addition 
of data loading and refreshing. The physical structure of DWs is typically comprised of either relational 
DBs or multidimensional DCs provide a multidimensional view of stored data and enable various OLAP 
techniques for data analysis, such as drill-down, roll-up (drill-up), slice & dice, pivot (rotate), and others 
(drill-across, drill-through, ranking the top/bottom N items, computing moving averages and growths 
rates and depreciation, etc.). 

Consider the example of web sales of different items (auto, motorcycle, farm equipment, and 
racing equipment parts). Imagine the web sale sites are located in the following states: Idaho, Wyoming, 
Montana, and Utah. Figure 30.9 illustrates this scenario, where web sales of auto parts by quarters 
recorded by the Idaho server were 230, 220, 314, and 297. This scenario will be used later to clarify basic 
OLAP techniques. 
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FIGURE 30.6 4D data cube composed of 3D data cubes. 
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FIGURE 30.7 Lattice of cuboids, based on the 4D data in previous figure. 


The roll-up (or drill-up) technique assumes mapping from low to high-level concepts (for example, 
aggregation or reduction in dimensions from quarters to years). The opposite technique, roll-down (or 
drill-down) assumes dimension expansion (for example, from quarters to months). Dicing assumes the 
technique of extracting sub-cubes of certain dimensions from original data cube. Dicing is illustrated in 
Figure 30.9, upper right corner (time = Q2 or Q3, location = Idaho or Wyoming and item = Auto Parts 
or Racing Parts). The technique of slicing represents a selection of one dimension, while pivoting simply 
rotates the existing dimensions states and items, also illustrated in Figure 30.9 (lower right corner). 


30.4 Data Repositories, Data Mining Tasks, and Data 
Mining Patterns 


30.4.1 Data Repositories 


Data mining in a general sense can be applied to various kinds of data. In the temporal sense, the data 
can be static or transient (data streams). Thus, data repositories use varies from flat files, transactional 
databases, to relational databases, data marts and data warehouses. 

Transactional databases typically consist of flat files in a table format (one record, one transaction). 
Relational databases (database management systems or DBMSs), are comprised of tables with unique 
names with columns (attributes, or fields), and rows (records, or tuples). A unique key described by a set 
of attributes then identifies each tuple. The semantic, entity-relationship (ER) data model then describes 
the database via a set of entities with their relationships. 


30.4.2 Data Mining Tasks 


In principle, data mining tasks can be categorized as descriptive (characterizing the general prop- 
erties of DB data) and predictive (specific inference on data to enable predictions). Data on the 
other hand can be associated with classes (concepts), i.e., class/concept descriptions. These descrip- 
tions can be derived via data characterization (summarizing the general characteristics of the 
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FIGURE 30.8 Sample data cube (above) and various aggregations it. 


class of data), or data discrimination (comparison of the one class with other, contrasting classes). 
Summarization and characterization can be done using various techniques. Some of the known 
summarization techniques are OLAP roll-up operation to achieve data summarization along a spe- 
cific dimension (Figure 30.9), statistical measures and plots, attribute-oriented induction. Data 
characterization techniques on the other hand can entail pie, bar, charts, n-dimensional DCs, mul- 
tidimensional tables, and crosstabs. 

An example of data characterization can be the task of summarizing the demographics of custom- 
ers that spend about $25,000 on a new car every 3 years. This characterization may result in a profile 
of employed individuals who are 35-45 years old and who have credit ratings in preferable range. The 
system should allow a drill-down operation to obtain, for example, different sex, education level, or 
occupation type. 

An example of data discrimination could be the comparison of a specific profile of convertible car 
customers in one state, say Florida, against a contrasting class of customers of the same type of vehicles 
from another state say Idaho. A further example can be discriminating the descriptions of the same 
profile of car customers today and 10 years ago. 
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FIGURE 30.9 Basic OLAP techniques: roll-up, roll down, dice, slice, and pivot. 


30.4.3 Data Mining Patterns 


The frequent pattern recognition of data mining makes assumptions about identification of associ- 
ations, correlations, classification, leading to description and prediction. For example, people who 
often deal with graphics or video processing applications may opt for certain type of Macintosh 
computers. Frequently, this purchase would be followed by an additional backup device such as 
Time Capsule. Younger gaming-oriented clientele will frequently follow a computer purchase with 
various gaming devices (joystick or gaming keyboard). These events are known as frequent sequen- 
tial patterns. They can result in rules such as “80% of web designer professionals within a certain 
age and income group will buy Macintosh computers.” In multidimensional databases, the previous 
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statement would be recognized as a multidimensional association rule (dimensions here refer to 
age,” “income,” and “buy,” for example, with web designer professionals). 


30.5 Data Mining Techniques 


Numerous data mining techniques exist, ranging from rule-based, Bayesian belief networks, support 
vector machines; to artificial neural networks, k-nearest neighbor classifiers; to mining time series, 
graph mining, social networking, and multimedia (text, audio, video) data mining. A few select tech- 
niques will be addressed here. (For more details, please refer to [Han 06,Berry 04,Thuraisingham 
2003,Matignon 07,Witten 2005,Witten 2000].) 


30.5.1 Regression Analysis 


The term “regression” was introduced in the eighteenth century by Sir Francis Galton, cousin of C. 
Darwin, and had a pure biological connotation. It was known by regression toward the mean, where 
offspring of exceptional individuals tend on average to be less exceptional than their parents and closer 
to their more distant ancestors. 

Regression analysis models the predictor-response relationship between independent variables 
(known values, predictor), and dependent variables (responses, values to predict). By the use of regres- 
sion, curve fitting can be done as generalized linear, Poisson regression, log-linear, regression trees, least 
square, spline, or fractal. 

Regression can be linear (curve fitting to a line), or nonlinear (in closed-form or iterative such as 
steepest descent, Newton method, or Levenberg-Marquardt). Also, regression can be parametric where 
the regression function is defined by unknown parameters (LSM, least square method), or by nonpara- 
metric (functions such as polynomial regression). Fuzzy regression (fuzzy linear least square regression) 
can be used to address the phenomenon of data uncertainty driving the solution uncertainty (non- 
parametric regression) [Roychowdhury 98]. For example, trying to perform linear regression over the 
function y;: 


F(X) =P3 ViX Vn Viz Vite, yi = wx; +wo (30.2) 


can be reduced to finding w, and w, by minimizing the sum of the squared residuals (SSR), or setting the 
derivatives of SSR with regard to both variables w, and w, to zero: 


SSR = y*( I i-(m x; + w)) 
it (30.3) 
dSSR _,. 9SSR _ 


Ow) ow, a 


30.5.2 Decision Trees 


Classification rules can easily substitute decision trees (if buyer = WebDesigner => computer = Macintosh) 
[Han 06,Witten 2005,Witten 2000]. These rules may have exceptions (if buyer = WebDesigner except if 
buyerPreference = Windows => computer = Macintosh). Association rules, in addition, can demonstrate 
how strong an association exists between frequently occurring attribute—value pairs, or items based on 
frequent itemset mining. For example: 

if buyer = WebDesigner and buyer - GraphicsUser => computer = Macintosh [support 70%, 
confidence = 95%] 
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Here, support or coverage of 70% means that out of the customers under study, 70% of the buyers that 
are web designers and graphic users have purchased Macintosh computers (correct prediction vs. pro- 
portion of instances to which the rule applies). Accuracy or confidence in the 95% represents the prob- 
ability that a customer of this type will purchase a Macintosh computer. 


30.5.3 Neural Networks 


Artificial neural networks (ANNs) can be used as universal approximators and classifiers. Artificial 
neurons represent summation, threshold elements. When weighted (w,), the sum of the neuron inputs 
(x,) exceeds the defined threshold value, the neuron then produces an output, i.e. “fires”: 

1 ifnet20 
(30.4) 


n 
net= )wixi+ w » out= : 
er 0 ifnet<0 


i=l 


The weight (w;,,) with default input +1 is called the bias, and can be understood as the threshold (T), but 
with the opposite sign (Figure 30.10). 

Typically used threshold functions are bipolar sinusoidal threshold functions as described in the 
following equation: 


2 
1+ exp(—2-k-net) 


Ovip = foip(k- net) = tanh (k- net) = (30.5) 


The graphical representation ofa single neuron operation can be described easily via analytic geometry. Thus, 
a single, two input neuron as illustrated by Figure 30.11, represents a linear classifier where the neuron defini- 
tion is 1x + 3y —3 > 0. The weights used for inputs x and y are 1 and 3, respectively. The neuron divides the xOy 
space into two areas by selecting the upper one. This neuron correctly classifies the rectangular pattern pro- 
ducing the output +1 identical to the desired output, +1 (point 2,1). Further, by deselecting the lower part of 
xOy space, the neuron produces —1 on the output, again matching the desired output for the pattern (0.5, 0.5). 
One of the most used ANN algorithms is error back propagation (EBP), proposed by Werbos in 
1994 and Rumelhart in 1986 [Werbos 94,Rumelhart 86]. Other popular algorithms include modifica- 
tion of EBP (Quickprop, RPROP, Delta-Bar-Delta, Back Percolation), Levenberg-Marquardt, adaptive 
resonance theory (ART), counterpropagation networks (CPN), and cascade correlation networks. 


Out Out 
k Net k Net 
(a) 
#1 


(b) +1 


FIGURE 30.10 (a) Typically used threshold functions: unipolar and bipolar sigmoidal and (b) artificial neuron 
as weighted threshold element. 
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FIGURE 30.11 Graphical representation of single neuron operation. 


30.6 Multidimensional Database Schemas 


Operational databases are based on entity-relationship (E-R) data models or schemas, which describe the 
set of entities and relationships among them. Data warehouse schemas reflect the subject-oriented sche- 
mas, more suitable for OLAP. Three typical OLAP schemas are star, snowflake, and fact constellation. 
A star schema is composed of the fact (central) table with keys to (four) dimension tables (redundan- 
cies are possible). A snowflake schema is composed of centralized fact tables connected to dimensions, 
hence resembling a snowflake (E-R relationship). The snowflake schema is easier to maintain, less effec- 
tive in browsing (some dimension tables are normalized, no redundancies). A collection of stars pro- 
duces fact constellation (two fact tables) schema (dimension tables are now shared among fact tables). 


30.7 Mining Multimedia Data 


Data mining of text, web, and other multimedia type data has experienced constant growth in the recent 
decade. The applications vary from research, scientific, social networking, to governmental, business, 
and marketing. 

One of the famous algorithms developed specifically for the World Wide Web, but also generally 
applicable to search and structural analysis is the hyperlink-induced topic search (HITS) algorithm. The 
HITS algorithm was introduced by Jon Kleinberg from Cornell University while a visiting scientist in 
the CLEVER project at IBM’s Almaden Research Laboratory [Kleinberg 98]. 

The HITS algorithm is also known as hubs and authorities and represents the distillation of broad 
search topics via authorities and hubs joined in the link structure. User queries present various obsta- 
cles. For example, specific queries are often associated with scarcity problem—answers are hard to find, 
while broad topic queries are often burdened by an abundance problem, resulting in too many hits for 
humans to digest. And, a “filter” idea based on the notion of authoritative pages. However, one may 
quickly realize that purely endogenous measures can be hard to establish. Kleinberg mentions several 
examples such as the term “Harvard” is not necessarily a word often used at www.harvard.edu; the term 
“search engines” may not be necessarily used on many of the natural authorities (Yahoo, AltaVista); or 
Honda or Toyota may not be using terms “auto manufacturer” on their pages [Kleinberg 98]. 

The solution Kleinberg pointed out lies in the analysis of the link structure where the creator of page 
p links to page q and confers the authority g. Collective endorsements in this way solve the problem of 
non self-descriptive pages. Hubs, set of pages that link to authorities, represent the mutual reinforce- 
ment relationship that facilitate automated discovery. The mutual reinforcement means that a good hub 
is a page that points to many good authorities, but also that a good authority is a page that is pointed to 
by many good hubs. This approach is not without risks (paid endorsements, commercial competitors 
deliberately omitting words). 
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The HITS algorithm defines an eigenvector of adjacency matrices associated with the link graph 
(weighted links among authorities and hubs). In this way, the algorithm produces a list of hubs and 
authorities with large weights, regardless of the weights initialization. 


30.8 Accuracy Estimation and Improvement Techniques 


Regardless of the specific data mining task, it is beneficial to estimate the accuracy of the process exe- 
cuted and also be able to improve the accuracy achieved. Although these statistical analysis techniques 
are presented here in light of data mining tasks, they can also be applied for general accuracy estimation 
and improvement of training-testing methods such as classification and prediction. 


30.8.1 Accuracy Measures 


The holdout method assumes partitioning of a data set into two sets, training 2/3, and testing 1/3, where 
testing is executed after the training process of a classifier has been completed. Random sampling assumes 
k repetitions of holdout method, with the average being the overall accuracy. K-fold cross-validation 
assumes splitting of initial data set into nearly equal size k subsets (folds), with the training and testing 
being executed k times (i.e., each fold is used for testing of a system trained on remaining k — 1 folds). 
Typically, the 10-fold approach is used. “Bootstrap,” introduced by Efron in 1979, is a concept similar to 
pulling yourself up by your own bootstrap and is a resampling technique that assumes uniform selection 
of training patterns with repetition (one pattern has the same probability of being trained upon, compris- 
ing a virtual training population) [Efron 79]. Every resample consists of the same number of observa- 
tions; bootstrap then can model the impact of the actual sample size [Fan 96]. One of the most popular 
approaches is the 0.632 bootstrap method in which, as it turns out, 63.2% of the original data set of d tuples 
will end up in bootstrap with 36.8%, forming the test data set [Efron 97]. This idea comes from the 1/d 
probability of a sample tuple being selected from a data set of d tuples: 


d 
lim (1 - 1) =e! = 0.368 (30.6) 


doo 


with the accuracy estimated on both training and test sets: 


k 
Accuracy(M) = >: (0.632 - Accuracy(M, )test_set + 0.368: Accuracy(M;)train_set ) (30.7) 


i=l 


30.8.2 Accuracy Improvement 


Once the accuracy of the data mining task at hand is estimated, techniques for accuracy improvement 
can be applied. Commonly used techniques are known as bagging and boosting. 

Bagging, known as bootstrap aggregation, is based on the number of bootstrap samples with pat- 
terns sampled with replacements. The algorithm returns its prediction (vote) when presented with a 
previously unseen pattern. The bagged algorithm (classifier for example) is based on majority of votes. 
The bagged classifier is therefore comprised of n classifiers, each trained on one bootstrap sample. The 
advantages of such classifiers are increased accuracy over the single model based on the initial set of 
patterns and robustness to noisy data. 

Boosting assumes assigning weights relative to the “pattern’s difficulty.” The higher the weight, the 
more difficult it is to train on it. The weights are assigned after each of the n models is trained. The final 
model combines the votes of each of the n models, with the weight being the function of its accuracy. 
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One of the most often used boosting algorithms is the AdaBoost, introduced by Feund and Schapire 
in 1995 [Freund 97]. AdaBoost solved many of the practical difficulties of the earlier boosting algo- 
rithms [Freund 99]. 


30.9 Summary 


Although already often recognized as the umbrella of vast array of data analysis techniques, data mining 
techniques will undoubtedly experience even further popularity with the perpetual growth of data, cou- 
pled with advances in data storage technology. While data mining underwent tremendous changes over 
the years, from early file processing systems, via hierarchical, relation databases, from customer (trans- 
action-oriented) OLTP, via marketing (analytical oriented) OLAP, to multidimensional/hybrid MOLAP/ 
HOLAP processing techniques, the future trends in data mining seem to be dependent on the develop- 
ment of sophisticated computational intelligence techniques for cluster analysis, trending, and prediction 
of various types of data (multimedia, web, stream, sequence, time-series, social networking, and others). 
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Artificial neural networks perform signal processing and they learn. However, they cannot autono- 
mously learn and develop like a brain. Autonomous mental development models all or part of the brain 
and how a system develops autonomously through interactions with the environments. The most fun- 
damental difference between traditional machine leaning and autonomous mental development is that 
a developmental program is task nonspecific so that it can autonomously generate internal representa- 
tions for a wide variety of simple to complex tasks. This chapter first discusses why autonomous devel- 
opment is necessary based on a concept called task muddiness. No traditional methods can perform 
muddy tasks. If the electronic system that you design is meant to perform a muddy task, you need to 
enable it to develop its own mind. Then some basic concepts of autonomous development are explained, 
including the paradigm for autonomous development, mental architectures, developmental algorithm, 
a refined classification of types of machine learning, spatial complexity, and time complexity. Finally, 
the architecture of spatiotemporal machine that is capable of autonomous development is described. 


31.1 Biological Development 


A human being starts to develop from the time of conception. At that time, a single cell called a zygote is 
formed. In biology, the term genotype refers to all or part of the genetic constitution of an organism. The 
term phenotype refers to all or part of the visible properties of an organism that are produced through 
the interaction between the genotype and the environment. In the zygote, all the genetic constitution 
is called genome, which mostly resides in the nucleus of a cell. At the conception of a new human life, 
a biological program called the developmental program starts to run. The code of this program is the 
genome, but this program needs the entire cell as well as the cell’s environment to run properly. 

The biological developmental program handles two types of development, body development and 
mental development. The former is the development of everything in the body excluding the brain. 


31-1 
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The latter is the development of the brain (or the central nervous system, CNS). Through the body 
development, a normal child grows in size and weight, along with many other physical changes. Through 
the mental development, a normal child develops a series of mental capabilities through interactions 
with the environment. Mental capabilities refer to all known brain capabilities, which include, but not 
limited to, perceptual, cognitive, behavioral, and motivational capabilities. In this chapter, the term 
development refers to mental development unless stated otherwise. The biological mental development 
takes place in concurrence with the body development and they are closely related. For example, if the 
eyes are not normally developed, the development of the visual capabilities is greatly affected. In the 
development of an artificial agent, the body can be designed and fixed (not autonomously developed), 
which helps to reduce the complexity of the autonomous mental development. 

The genomic equivalence principle [36] is a very important biological concept for us to understand 
how biological development is regulated. This principle states that the set of genes in the nucleus of every 
cell (not only that in the zygote) is functionally complete—sufficient to regulate the development from 
a single cell into an entire adult life. This principle is dramatically demonstrated by cloning. This means 
that there are no genes that are devoted to more than one cell as a whole. Therefore, development guided 
by the genome is cell-centered. Carrying a complete set of genes and acting as an autonomous machine, 
every cell must handle its own learning while interacting with its external environment (e.g., other 
cells). Inside the brain, every neuron develops and learns in place. It does not need any dedicated learner 
outside the neuron. For example, it does not need an extracellular learner to compute the covariance 
matrix (or any other moment matrix or partial derivatives) of its input lines and store extracelullarly. If 
an artificial developmental program develops every artificial neuron based on only information that is 
available to the neuron itself (e.g., the cellular environment such as presynaptic activities, the develop- 
mental program inside the cell, and other information that can be biologically stored intracellularly), we 
call this type of learning in-place learning. 

This in-place concept is more restrictive than a common concept called “local learning.” For example, 
a local learning algorithm may require the computation of the covariance matrix of the presynaptic vec- 
tor, which must store extracellularly. In electronics, the in-place learning principle can greatly reduce 
the required electronics and storage space, in addition to the biological plausibility. For example, sup- 
pose that every biological neuron requires the partial derivative matrix of its presynaptic vector. As the 
average number of synapses of a neuron in the brain is on the order of n = 1000. Each neuron requires 
about n? = 1,000,000 storage units outside every neuron. This corresponds to about 1,000,000 of the total 
number of synapses (10"*) in the brain! 

Conceptually, the fate and function of a neuron is not determined by a “hand-designed” (i.e., genome 
specified) meaning of the external environment. This is another consequence of the genomic equiva- 
lence principle. The genome in each cell regulates the cell’s mitosis, differentiation, migration, branch- 
ing, and connections, but it does not regulate the meaning of what the cell does when it receives signals 
from other connected cells. For example, we can find a V1 cell (neuron) that responds to an edge of a 
particular orientation. This is just a facet of many emergent properties of the cell that are consequences 
of the cell’s own biological properties and the activities of its environment. A developmental program 
does not need to, and should not, specify which neuron detects a pre-specified feature type (such as an 
edge or motion). 


31.2 Why Autonomous Mental Development? 


One can see that biological development is very “low level,” regulating only individual neurons. Then, 
why is it necessary to enable our complex electronic machines to develop autonomously? Why do we not 
design high-level concepts into the machines and enable them to carry out our high-level directives? In 
fact, this is exactly what many symbolic methods have been doing for many years. Unfortunately, the 
resulting machines are brittle—they fail miserably in real world when the environment fall out of the 
domains that have been modeled by the programmer. 
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To appreciate what are faced by a machine to carry out a complex task, Weng [48] introduced a concept 
called task muddiness. The composite muddiness ofa task is a multiplicative product of many individual 
muddiness measures. There are many possible individual muddiness measures. Those individual mud- 
diness measures are not necessarily mutually independent or at the same level of abstraction, since such 
a requirement is not practical nor necessary for describing the muddiness of a task. They fall into five 
categories: (1) external environment, (2) input, (3) internal environment, (4) output, and (5) goal, as 
shown in Table 31.1. The term “external” means external with respect to the brain and “internal” means 
internal to the brain. 

The composite muddiness of a task can be considered as a product of all individual muddiness 
measures. In other words, a task is extremely muddy when all the five categories have a high measure. 
A chess playing task with symbolic input and output is a clean problem because it is low in categories 
(1) through (5). A symbolic language translation problem is low in (1), (2), and (4), moderate in (3) but 
high in (5). A vision-guided navigation task for natural human environment is high in (1), (2), (3), 
and (5), but moderate in (4). A human adult handles extremely muddy tasks that are high in all the five 
categories. 

From the muddiness table (Table 31.1), we have a more detailed appreciation what a human adult 
deals with even in a daily task, e.g., navigating or driving in a city environment. The composite mud- 
diness of many tasks that a human or a machine can execute is proposed by Weng [48] as a metric for 
measuring required intelligence. 


TABLE 31.1 List of Muddiness Factors for a Task 


Category Factor Clean Muddy 
External environment Awareness Known Unknown 
Complexity Simple Complex 
Controlledness Controlled Uncontrolled 
Variation Fixed Changing 
Foreseeability Foreseeable Nonforeseeable 
Input Rawness Symbolic Real sensor 
Size Small Large 
Background None Complex 
Variation Simple Complex 
Occlusion None Severe 
Activeness Passive Active 
Modality Simple Complex 
Multi-modality Single Multiple 
Internal environment Size Small Large 
Representation Given Not given 
Observability Observable Unobservable 
Imposability Imposable Nonimposable 
Time coverage Simple Complex 
Output Terminalness Low High 
Size Small Large 
Modality Simple Complex 
Multimodality Single Multiple 
Goal Richness Low High 
Variability Fixed Variable 
Availability Given Unknown 
Telling-mode Text Multimodal 
Conveying-mode Simple Complex 
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A human infant is not able to perform those muddy tasks that a human adult performs everyday. The 
process of mental development is necessary to develop such a wide array of mental skills. Much evidence 
in developmental psychology has demonstrated that not only a process of development is necessary for 
human intelligence, the environment of the development is also critical for normal development. 

Likewise, it is not practical for a human programmer to program a machine to successfully execute 
a muddy task. Computers have done very well for clean tasks, such as playing chess games. But they 
have done poorly in performing muddy tasks, such as visual and language understanding. Enabling a 
machine to autonomously develop task skills in its real task environments is the only approach that has 
been proved successful for muddy tasks—no existing higher intelligence for muddy tasks is not devel- 
oped autonomously. 


31.3 Paradigm of Autonomous Development 


By definition, an agent is something that senses and acts. A robot is an agent, so is a human. In the 
early days of artificial intelligence, smart systems that caught the general public’s imagination were 
programmed by a set of task-specific rules. The field of artificial intelligence moved beyond that early 
stage when it started the trend of studying general agent methodology [40], although the agent is still a 
task-specific machine. 

As far as we know, Cresceptron 1993 [50,51] was the first developmental model for visual learning 
from complex natural backgrounds. By developmental, we mean that the internal representation is fully 
emergent from interactions with environment, without allowing a human to manually instantiate a 
task-specific representation. By the mid-1990s, connectionists had started the exploration of the chal- 
lenging domain of development [8,29,37]. 

Due to a lack of the breadth and depth of the multidisciplinary knowledge in the single mind of a 
researcher or a reviewer, there have been various doubts from domain experts, mainly due to the wide- 
spread lack of sufficient cross-disciplinary knowledge discussed above. Examples include the following 
assumptions: (1) Artificial intelligence does not need to follow the brain’s way. (2) Modeling the human 
mind does not need to follow the brain’s way. (3) Your commitment to understanding the brain is laud- 
able but naive. 

There is a lack of bylaws, guidelines, and due process that contain the negative effects of human 
nature that are well documented by Thomas Kuhn [25]. Such negative effects eroded “revolutionary 
advances” required by some programs. Serious overhauls and investments for the infrastructure for 
converging research on intelligence are urgently needed. Such infrastructure is necessary for the healthy 
development of science and technology in the modern time. 

Not until the birth of the new AMD field marked by the NSF- and DARPA-funded Workshop on 
Development and Learning 2000 [54,55] has the concept of the task-nonspecific developmental program 
caught the attention of researchers. A hallmark difference between traditional artificial intelligence 
approaches and autonomous mental development [54] is the task specificity. All the existing approaches 
to artificial intelligence are task specific, except the developmental approach. Table 31.2 lists the major 
differences among existing approaches to artificial intelligence. An entry marked as “avoid modeling” 
means that the representation is emergent from experience. 


TABLE 31.2 Comparison of Approaches to Artificial Intelligence 


Approach Species Architecture World Knowledge Agent Behaviors Task Specific 
Knowledge-based Model Model Model Yes 
Learning-based Model Parametrically model Model Yes 
Behavior-based Model Avoid modeling Model Yes 
Genetic Genetic search Parametrically model Model Yes 
Developmental Parametrically model Avoid modeling Minimize modeling No 
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FIGURE 31.1 Illustration of the paradigm of developmental agents, inspired by human mental development. 
No task is given during the programming (i.e., conception) time, during which a general-purpose task-nonspecific 
developmental program is loaded onto the agent. Prenatal development is used for developing some initial process- 
ing pathways in the brain using spontaneous (internally generated) signals from sensors. After the birth, the agent 
starts to learn an open series of tasks through interactions with the physical world. The tasks that the agent learns 
are determined after the birth. 


Traditionally, given a task to be executed by the machine, it is the human programmer who under- 
stands the task and, based on his understanding, designs a task-specific representation. Depending 
on different approaches, different techniques are used to produce the mapping from sensory inputs to 
effector outputs. The techniques used range from direct programming (knowledge-based approach), 
to learning the parameters (in the parametric model), to genetic search (genetic approach). Although 
genetic search is a powerful method, the chromosome representations used in artificial genetic search 
algorithms are task specific. 

Using the developmental approach, the tasks that the robot (or human) ends up doing are unknown 
during the programming time (or conception time), as illustrated in Figure 31.1. The ecological condi- 
tions that the robot will operate under must be known, so that the programmer can design the body of 
the robot, including sensors and effectors, suited for the ecological conditions. The programmer may 
guess some typical tasks that the robot will learn to perform. However, world knowledge is not modeled 
and only a set of simple reflexes is allowed for the developmental program. During “prenatal” develop- 
ment, internally generated synthetic data can be used to develop the system before birth. For example, 
the retina may generate spontaneous signals to be used for the prenatal development of the visual path- 
way. At the “birth” time, the robot’s power is turned on. The robot starts to interact with its environ- 
ment, including its teachers, in real time. The tasks the robot learns are in the mind of its teachers. In 
order for the later learning to use the skills learned in early learning, a well-designed sequence of educa- 
tional experience is an important practical issue. 


31.4 Learning Types 


In the machine learning literature, there have been widely accepted definitions of learning types, such as 
supervised, unsupervised, and reinforcement learning. However, these conventional definitions are too 
coarse to describe computational learning through autonomous development. For example, it is difficult 
to identify any type of learning that is completely unsupervised. Further, the traditional classification of 
animal learning models, such as classical conditioning and instrumental conditioning, is not sufficient 
to address computational considerations of every time instant of learning. A definition ofa refined clas- 
sification of learning types is necessary. 
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TABLE 31.3 Eight Types of Learning 


Type (Binary) Internal State Effector Biased Sensor 
0 (000) Autonomous Autonomous Communicative 
1 (001) Autonomous Autonomous Reinforcement 
2 (010) Autonomous Imposed Communicative 
3 (011) Autonomous Imposed Reinforcement 
4 (100) Imposable Autonomous Communicative 
5 (101) Imposable Autonomous Reinforcement 
6 (110) Imposable Imposed Communicative 
7 (111) Imposable Imposed Reinforcement 


We use a variable i to indicate internal task-specific representation imposed by human programmer 
(called internal-state imposed i = 1) or not (called internal-state autonomous i = 0). 

We use e to denote autonomy of effector. If the concerned effector is directly guided by the human 
teacher or other teaching mechanisms for the desired action, we call the situation action imposed (e = 1). 
Otherwise, the learning is effector autonomous (e = 0). 

We need to distinguish the channels of reward (e.g., sweet and pain sensors) that are available at the 
birth time, and other channels of reward that are not ready to be used as reward at the birth time (e.g., 
auditory input “good” or “bad”) but implies a value after a certain amount of development. We define 
(inborn) biased sensors: 

If the machine has a predefined preference pattern to the signals from a sensor at the birth time, this 
sensor is an (inborn) biased sensor. Otherwise, it is an (inborn) unbiased sensor. 

In fact, all the sensors become biased gradually through postnatal experience—the development of 
the value system. For example, the image of a flower does not give a newborn baby much reward, but the 
same image becomes pleasant to look at (high value) after the baby has gown up. 

We use the third variable b to denote whether a biased sensor is used. If any biased sensor is activated 
(sensed) during the learning, we called the situation reinforcement (b = 1). Otherwise, the learning is 
called communicative (b = 0). 

Using these three key factors, any type of learning can be represented by a 3-tuple (i, e, b), which 
contains three components i, e, and b, each of which can be either represented by 0 or 1. Thus, there are 
a total of eight different 3-tuples, representing a total of eight different learning types. If we consider ieb 
as three binary bits of the type index number of learning type, we have eight types of learning defined in 
Table 31.3. We can also name each type. For example, Type 0 is state-autonomous, effector-autonomous, 
communicative learning. Type 7 is state-imposable, effector-imposed, reinforcement learning, but it has 
not been included in the traditional definition of either supervised learning or reinforcement learning. 
However, this learning is useful when teaching a positive or negative lesson through supervision. 

Using three key features, state-imposed, effector-imposed and reinforcement, eight learning types 
are defined. This refined definition is necessary to understanding various modes of developmental and 
nondevelopmental learning. 

All learning types using a non-developmental learning method corresponding to Types 7 to 4, 
this is because the task-specific representation is at least partially handcrafted after the task is given. 
Autonomous mental development uses Types 0 to 3. 


31.5 Developmental Mental Architectures 


Weng [47] proposed a SASE model through which the agent can autonomously learn to think, while the 
thinking behavior is manifested as internal attention. Attention is a key to emergent intelligence. 
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31.5.1 Top-Down Attention Is Hard 


Consider a car in a complex urban street environment. Attention and recognition are a pair of dual- 
feedback problems. Without attention, recognition cannot do well; recognition requires attended 
areas (e.g., the car area) for the further processing (e.g., to recognize the car). Without recognition, 
attention cannot do well; attention requires recognition for guidance of the next fixation (e.g., a pos- 
sible car area). 


31.5.1.1 Bottom-Up Attention 


Studies in psychology, physiology, and neuroscience provided qualitative models for bottom-up atten- 
tion, ie., attention uses different properties of sensory inputs, e.g., color, shape, and illuminance to 
extract saliency. Several models of bottom-up attention have been published. The first explicit com- 
putational model of bottom-up attention was proposed by Koch and Ullman in 1985 [24], in which 
a “saliency map” is computed to encode stimuli saliency at every lactation in the visual scene. More 
recently, Itti and Koch et al. [17] integrated color, intensity, and orientation as basic features in multiple 
scales for attention control. An active-vision system, called NAVIS (neural active vision) by Backer et al., 
was proposed to conduct the visual attention selection in a dynamic visual scene [1]. Our SASE model 
to be discussed next indicates that saliency is not necessarily independent of learning: The top-down 
process in the previous time instant may affect the current bottom-up saliency. 


31.5.1.2 Top-Down Attention 


Volitional shifts of attention are also thought to be performed top-down, through spacial defined and 
feature-dependant controls. Olshausen et al. [33] proposed a model of how visual attention can be 
directed to address the position and scale invariance in object recognition, assuming that the posi- 
tion and size information is available from the top control. Tsotsos et al. [45] implemented a version 
of attention selection using a combination of a bottom-up feature extraction scheme and a top-down 
position selective tuning scheme. Rao and Ballard [39] described a pair of cooperating neural net- 
works, to estimate object identity and object transformations, respectively. Schill et al. [42] presented 
a top-down, knowledge-based reasoning system with a low-level preprocessing where eye movement 
is to maximize the information about the scene. Deco and Rolls [5] wrote a model of object recogni- 
tion that incorporates top-down attentional mechanisms on a hierarchically organized set of visual 
cortical areas. In the above studies, the model of Deco and Rolls [5] was probably the most biologically 
plausible, as it incorporates bottom-up and top-down flows into individual neuronal computation, but 
unfortunately the top-down connections were disabled during learning and no recognition perfor- 
mance data were reported. 

In the Where-What Network 2 (WWN-2) experiment [18] discussed later, we found that the cor- 
responding network that drops the L4-L2/3 laminar structure gave a recognition rate lower than 50%. 
In other words, a network that treats top-down connection similar to bottom-up connection (like a 
uniform liquid state machine [38]) is not likely to achieve an acceptable performance. 


31.5.2 Motor Shapes Cortical Areas 


On one hand, high-order (i.e., later) visual cortex of the adult brain includes functionally specific 
regions that preferentially respond to objects, faces, or places. For example, the fusiform face area (FFA) 
responds to face stimuli [11,21], and the parahippocampal place area (PPA) responds to place identity 
[2,7,32]. How does the brain accomplish this feat of localizing internal representation based on mean- 
ing? Why is such a representation necessary? 

In the cerebral cortex, there is a dense web of anatomically prominent feedback (i.e., top-down) con- 
nections [9,20,22,23,35,41]. It has been reported that cortical feedback improves discrimination between 
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figure and background and plays a role in attention and memory [12,14,44]. Do feedback connections 
perform attention? Furthermore, do feedback connections play a role in developing abstractive internal 
representation? 

The computational roles of feedback connections in developing meaning-based internal representa- 
tions have not been clarified in existing studies reviewed above. The self-abstractive architecture next 
indicates that in the cerebral cortex, each function layer (L4 and L2/3) is a state at this layer. We will 
show that, unlike the states in POMDP, HMM, Hopfield network and many others, the states in the self- 
abstractive architecture integrate information from bottom-up inputs (feature inputs), lateral inputs 
(collaborative context), and top-down inputs (abstract contexts) into a concise continuous vector repre- 
sentation, without the artificial boundaries of a symbolic representation. 


31.5.3 Brain Scale: “Where” and “What” Pathways 


Since the work of Ungerleider and Mishkin 1982 [30,46], a widely accepted description of visual cortical 
areas is illustrated in Figure 31.2 [9,33]. A ventral or “what” stream that runs from V1, to V2, V4, and IT 
areas TEO and TE computes properties of object identity such as shape and color. A dorsal or “where” 
stream that runs from V1, to V2, V3, MT, and the medial superior temporal areas MST, and on to the 
posterior parietal cortex (PP) computes properties of the location of the stimulus on the retina or with 
respect to the animal’s head. Neurons in early visual areas have small spatial receptive fields (RFs) and 
code basic image features; neurons in later areas have large RFs and code abstract features such as behav- 
ioral relevance. Selective attention coordinates the activity of neurons to affect their competition and 
link distributed object representations to behaviors (e.g., see the review by Serences and Yantius [43]). 

With the above rich, suggestive information from neuroscience, I propose that the development of the 
functions of the “where” and “what” pathways is largely due to the following: 


1. Downstream motors. The motor ends of the dorsal pathway that perform position tasks (e.g., stretch- 
ing an arm to reaching for an apple or a tool), and the motor ends of the ventral pathway that perform 
type classification and conceptual tasks (e.g., different limbic needs between a food and an enemy). 

2. Top-down connections. The top-down connections from motor areas that shape the correspond- 
ing pathway representations. 


Dorsal pathway 


MT: Middle temporal 
PP: Posterior parietal 
LIP: Lateral intraparietal 
IT: Inferior temporal 
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FIGURE 31.2 (a) How does the brain generate internal representation? The only external sources are sensors 
and effectors. The imaginary page slices the brain to “peek” into its internal representation. (b) The dorsal “where” 


pathway and the ventral “what” pathways. The nature of the processing along each pathway is shaped by not only 
sensory inputs but also the motor outputs. 
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FIGURE 31.3 The system diagram: multi-sensory and multi-effector integration through learning. 


Put in a short way, motor is abstract. Any meaning that can be communicated between humans is 
motorized: spoken, written, hand-signed, etc. Of course, “motor is abstract” does not mean that every 
stage of every motor action sequence is abstract. However, the sequences of motor actions provide sta- 
tistically crucial information for the development of internal abstractive representation. 


31.5.4 System 


The system level architecture is illustrated in Figure 31.3. 
An agent, either biological or artificial, can perform regression and classification. 


Regression: The agent takes a vector as input (a set of receptors). For vision, the input vector corresponds to 
a retinal image. The output of the network corresponds to motor signals, with multiple components to be 
active (firing). The brain is a very complex spatiotemporal regressor. 


Classification: The agent can perform classification before it has developed sophisticated human lan- 
guage capability to verbally tell us the name of a class. For example, each neuron in the output layer 
corresponds to a different class. 


31.5.4.1 Two Signal Sources: Sensor and Motor 


The brain faces a major challenge as shown in Figure 31.2a. It does not have the luxury of having a 
human teacher to implant symbols into it, as the brain is not accessible directly to the external human 
teacher. Thus, it must generate internal representations from the two signal sources: the sensors and the 
effectors (motors). This challenging goal is accomplished by the brain’s where-what networks schemati- 
cally illustrated in Figure 31.4. The system has two motor areas, the where motor that indicates where 
the attended object is and the what motor that tells what the attended object is. This specialization of 
each pathway makes computation of internal representation more effective. 


31.5.5 Pathway Scale: Bottom-Up and Top-Down 


It is known that cortical regions are typically interconnected in both directions [4,9,57]. However, com- 
putational models that incorporate both bottom-up and top-down connections have resisted full anal- 
ysis [3,5,10,16,19,26,31]. The computational model, illustrated in Figure 31.5, provides further details 
about how each functional level in cortex takes inputs from the bottom-up signal representation space 
X and top-down signal representation space Z to generate and update self-organized cortical bridge 
representation space Y. This model further computationally predicts that a primary reason for the dorsal 
and ventral pathways to be able to deal with “where” and “what” (or achieving identity and positional 
invariances [19]), respectively, is that they receive top-down signals that drive their motors. 

From where does the forebrain receive teaching signals that supervise its motors? Such supervised- 
motor signals can be generated either externally (e.g., a child passively learns writing while his teacher 
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FIGURE 31.4 A schematic illustration of the visual where-what networks (WWNs). Features in the drawing are 
for intuition only since they are developed by the dually optimal LCA instead. V1 neurons have small receptive 
fields, which represent local features. Top-down connections enable V1 to recruit more neurons for action related 
features. V2 is similar to V1 but its neurons have larger receptive fields. They are both type specific and location 
specific, because they receive top-down information from both pathways. The pre-where-motor area in the dorsal 
pathway receives two sources of inputs, bottom-up from V2 and top-down from the “where” motor. Learning 
enables neurons to group according to position, to become less sensitive to type (or type quasi-invariant). In our 
WWN-2 experiments, the pre-where-motor neurons became almost positionally pure—each neuron only responds 
to a single retinal location. In the where-motor area, learning enables each neuron to link to premotor neurons 
that fired with it (when it was supervised), serving a logic-OR type of function. Thus, the where-motor becomes 
totally type-invariant but positional specific. The pre-what-motor area in the ventral pathway receives very dif- 
ferent top-down signals from the what motor (in contrast with the pre-where-motor) to become less sensitive to 
position (or position quasi-invariant). The what motor becomes totally positionally invariant but type specific. 
There are multiple firing neurons in V2 at any time, some responding to the foreground and some responding to 
the background. Each pre-motor area enables global competition, causing only foreground neurons to win (without 
top-down supervision) or attended neurons to win (with top-down supervision). Therefore, although background 
pixels cause V1 neurons and V2 neurons to fire, their signals cannot pass the two premotor areas. The experimental 
results are available at Ji and Weng [18] for WWN-2 and Luciw and Weng [27] for WWN-3. 


manually guides his hand) or internally (e.g., from the trials generated by the spinal cord or the mid 
brain). As illustrated in Figure 31.4, the model indicates that from early to later cortical areas, the 
neurons gradually increase their receptive fields and gradually reduce their effective fields as the pro- 
cessing of the corresponding bridge representations becomes less sensory and more motoric. 


31.5.6 Cortex Scale: Feature Layers and Assistant Layers 


The cerebral cortex contains six layers: layer L1 is the superficial layer and layer L6 is the deep layer. 
Weng et al. [53] reasoned that L4 and L2/3 are two feature detection layers as shown in Figure 31.5 
with L5 assisting L2/3 and Lé6 assisting L4, in the sense of enabling long range lateral inhibition. Such 
long-range inhibitions encourage different neurons to detect different features. The model illustrated in 
Figure 31.5 was informed by the work of Felleman and Van Essen [9], Callaway and coworkers [4,57], 
and others (e.g., [12]). There are no top-down connections from L2/3 to L4, indicating that L4 uses 
unsupervised learning (U) while L2/3 uses supervised (S) learning. Weng et al. [53] reported that sucha 
paired hierarchy USUS led to better recognition rates than the unpaired SSSS alternative. 
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FIGURE 31.5 Cortex scale: The spatial SASE network for both spatial processing and temporal processing with- 
out dedicated temporal components. At each temporal unit shown above (two time frames), three basic operations 
are possible: link, drop prefix, and drop postfix. After proper training, the TCM is able to attend any possible tem- 
poral context up to the temporal sampling resolution. 


31.5.6.1 Three-Source Attentive Spatiotemporal (TAS) Model 


Sequential attentions (i.e., a cortical mechanism of thinking [47]) must deal with temporal contexts. 
How the brain treats time has been largely a mystery [6,15,28]. The following cortex-inspired spatio- 
temporal model is a new theory. It enables adaptive, arbitrary temporal lengths without using a fixed 
temporal window length as iconic memory (e.g., the brain has only one retina!). Bottom-up and top- 
down two-way connections mean mutual dependency, which requires iterations to reach an “attractor 
valley” [13] if the model is for one-shot solution. However, the brain is for sequential decisions instead 
as illustrated in Figure 31.5. Thus, the convergence is not guaranteed in general while the brain thinks. 
Specifically, the level L2/3 based on its current own content L(t - 1) takes three signal sources: bottom- 
up input x(t- 1) as lower features, lateral input y(t — 1) as the last temporal context, and top-down input 
z(t — 1) as attention, all at time t - 1, through the cortical function modeled as the lobe component 
analysis (LCA) [52,56] which generates its response y(t) at time t as the attention-selected context and 
to update its level to L(t): 


(y(t), L(t) = Cortexyca (x(t —1), y(t — 1), z(t -1) | L(t -1)) (31.1) 


We call this process attentive context folding. The response vector y(t) from L2/3 is used as more abstract 
features (more motoric) for the next higher level but not as the top-down input for the lower L4. The 
L4 context folding is analogous to L2/3, except that it does not take top-down input and its response 
is also used as top-down input to L2/3 in the lower cortex, as illustrated in Figure 31.5. The absence 
of top-down flow to L4 reduces the undesirable top-down hallucination that has shown unacceptable 
experimental results (lower than 50% recognition rate). 


31.5.6.2 Sequential Decisions 


In sequential attentions, an outcome depends on multiple attention decisions in sequence, as each deci- 
sion depends on the outcomes of all the related previous decisions. For example, the motor output at 
time t is affected by the top-down attention signals from the previous motor output at time t - 2. Figure 
31.6 describes an example about how the attentive “machine” recursively makes sequential decisions— 
generates different top-down attentions and bottom-up attention-abstracted features at different times. 
Each top-down context directs the cortical region (L2/3-L4 combination) to attend to a different part 
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FIGURE 31.6 A schematic illustration of sequential decision making using top-down context (i.e., attention), 
while the sensory inputs (A, B, etc.) flow in one at a time. Two examples are presented here: Linking two sensory 
inputs (upper) and dropping a sensory input (lower). It has been proved [49] that any subset of the past context can 
be formed at the motor cortex, depending on the effectiveness of learning. 


(or feature set) of the bottom-up input. This leads to a sequence of different top-down attentions and a 
sequence of different cortical responses, regardless whether the bottom-up input (or the environment) 
changes or not. 


31.5.7 Level Scale: Dually Optimal CCI LCA 


As shown in Figure 31.6, given parallel input space consisting of the bottom-up space X and the top- 
down input space Z, represented as X x Z, the major developmental goal of each cortical level (L4 or L2/3 
in Figure 31.5) is to have different neurons in the level to detect different features, but nearby neurons 
should detect similar features. 

Each feature level faces two pairs of conflicting criteria which are probably implicit during biological 
evolution: (1) The spatial pair: with its limited number of neurons, the level must learn the best internal 
representation from the environment while keeping a stable long-term memory. (2) The spatiotemporal 
pair: with its limited child time for learning, the level must not only learn the best representation but 
also learn quickly without forgetting important mental skills acquired long time ago. The sparse coding 
principle [34] is useful to address the first pair: Allowing only a few neurons (best matched) to fire and 
update. Other neurons in the level are long-term memory because they are not affected. In other words, 
in each cortical region, only closely related mental skills are replaced each time. Therefore, the role of 
each neuron as working memory or long-term memory is dynamic, depending on the feature match (i.e., 
binding) with the input, as shown in Figure 31.7. However, this rough idea is not sufficient for optimality. 
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FIGURE 31.7 (a) The default connection pattern of every neuron in cortical layer L2/3 and L4 (no Z input for L4). 
The connections are local but two-way. (b) For each neuron in a layer, near neurons are connected to the neuron 
by excitatory connections (for layer smoothness) and far neurons are connected to the center neuron by inhibitory 
connections (competition resulting in detection of different features by different neurons). The upper layer indicates 
the positions for the neurons in the same layer: firing neurons are (context-dependent) working memory and those 
do not fire are (context dependent) long-term memory. The lower layer indicates the very high dimensional input 
space of the cortical layer (X x Z) but illustrated in 2-D. The shaded area indicates the manifold of the input distri- 
bution. The connection curve from the upper neuron and lower small circle indicates the correspondence between 
the upper neuron and the feature that it detects. The neuronal weight vectors must quickly move to this manifold 
as the inputs are received and further the density of the neurons in the purple area should reflect the density of the 
input distribution. The challenge of fast adaptation at various maturation stages of development: The updating tra- 
jectory of every neuron is a highly nonlinear trajectory. The statistical efficiency theory for neuronal weight update 
(amnesic average) results in the nearly minimum error in each age-dependent update, meaning not only the direc- 
tion of each update is nearly optimal, but also every step length. 


The cortex inspired candid incremental covariance-free (CCI) LCA [53,56] has the desired dual opti- 
mality: spatial and spatiotemporal, as illustrated in Figure 31.7. CCI LCA models optimal self-organi- 
zation by a cortical level with a limited resource: c neurons. The cortical level takes two parallel input 
spaces: the bottom-up space X and top-down space Z denoted as P = X x Zas illustrated by Figure 31.5. 
Each input vector is then denoted as p = (x, z) where xe Xandze Z.CCI LCA computes c feature vec- 
tors V,, V)--.,V,- Associated with these c feature vectors is a partition of the input space P into c disjoint 
regions R,, R,,..., R,, so that the input space P is the union of all these regions. For the optimal distribu- 
tion of neuronal resource, we consider that each input vector p is represented by the winner feature 
v, which has the highest response r;; 


= arg max 1, 
J 8 lsisc : 


where r;is the projection of input p onto the normalized feature vector v;: r, = p- (vi/|lvill) v;: r;= p- (v,/|\v])). 
The form of approximation of p is represented by p = r,v;/|lv| and the error of this representation for 


P is e(p) = |lp- pl. 
31.5.7.1 Spatial Optimality 


The spatial optimality requires that the spatial resource distribution in the cortical level is optimal in 
minimizing the representational error. For this optimality, the cortical-level developmental program 
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modeled by CCI LCA computes the best feature vectors V = (v,, v,, ..., Vv.) so that the expected square 
approximation error ||p(V) — pl is statistically minimized: 


v =(Vi,V2,--.V-)=argmin E ||p(V)—p|P. (31.2) 


where E denotes statistical expectation. The minimum error means the optimal allocation of limited 
neuronal resource: frequent experience is assigned with more neurons (e.g., human face recognition) 
but rare experience is assigned with fewer neurons (e.g., flower recognition for a nonexpert). This opti- 
mization problem must be computed incrementally, because the brain receives sensorimotor experience 
incrementally. As the feature vectors are incrementally updated from experience, the winner neurons 
for the past inputs are not necessarily the same if past inputs are fed into the brain again (e.g., parents’ 
speech when their baby was little is heard again by the grown-up baby). However, while the feature vec- 
tors are stabilized through extensive experience, the partition of the input space becomes also stable. 
Given a fixed partition, it has been proved that the best feature set V* consists of the c local first principal 
component vectors, one for each region R;. The term “local” means that the principal component vec- 
tor for region R; only considers the samples that fall into region R,. As the partition is tracking a slowly 
changing environment (e.g., while the child grows up), the optimal feature set V* tracks the slowly 
changing input distribution (called nonstationary random process). 

Intuitively speaking, the spatial optimality means that with the same cortical size, all the children 
will eventually perform at the best level allowed by the cortical size. However, to reach the same mental 
skill level one child may require more teaching than another. The spatiotemporal optimality is deeper. It 
requires the best performance for every time t. That is, the child learns quickest allowed by the cortical 
size at every stage of his age. 


31.5.7.2 Temporal Optimality 


The spatiotemporal optimality gives optimal step sizes of learning. Each neuron takes response weighted 
input u(t) = r(t)x(#) at time ¢ (i.e, Hebbian increment). From the mathematical theory of statistical effi- 
ciency, CCI LCA determines the optimal feature vectors V*(t) =(v; (t),v3(t),....v. (£)) for every time 
instant f starting from the conception time t = 0, so that the distance from V*(f) to its target V* is 
minimized: 


Vii= argmin E || Vit)-V' |p. (31.3) 


CCILCA aims at this deeper optimality—the smallest average error from the starting time (birth of the 
network) up to the current time ft, among all the possible estimators, under some regularity conditions. 
A closed form solution was found that automatically gives the optimal retention rate and the optimal 
learning rate (i.e., step size) at each synaptic update [56]. 

In summary, the spatial optimality leads to Hebbian incremental direction: response weighted pre- 
synamptic activity (rp). The deeper spatiotemporal optimality leads to the best learning rates, automati- 
cally determined by the update age of each neuron. This is like different racers racing on a rough terrain 
along a self-determined trajectory toward an unknown target. The spatially optimal racers, guided by 
Hebbian directions, do not know step sizes. Thus, they cover other trajectories that require more steps. 
The spatiotemporally optimal racer, CCI LCA, correctly estimates not only the optimal direction at 
every step as illustrated in Figure 31.7, but also the optimal step size at every step. In our experiments, 
CCI LCA out performed the self-organization map (SOM) algorithm by an order (over 10 times) in 
terms of percentage distance covered from the initial estimate to the target. This work also predicts cell- 
age dependent plasticity schedule which needs to be verified biologically. 
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31.6 Summary 


The material in this chapter outlines a series of tightly intertwined breakthroughs recently made in 
understanding and modeling how the brain develops and works. The grand picture of the human brain 
is getting increasingly clear. Developmental robots and machines urgently need industrial electronics 
for real-time, brain scale computation, and learning. This need is here and now. This has created a great 
challenge for the field of industrial electronics, but an exciting future as well. 


References 


1. 


10. 


11. 


12. 


13. 


14, 


15. 


16. 


17. 


18. 


G. Backer, B. Mertsching, and M. Bollmann. Data- and model-driven gaze control for an active- 
vision system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12):1415-1429, 
December 2001. 


. V. D. Bohbot and S. Corkin. Posterior parahippocampal place learning in h.m. Hippocampus, 


17:863-872, 2007. 


. T. J. Buschman and E. K. Miller. Top-down versus bottom-up control of attention in the prefrontal 


and posterior parietal cortices. Science, 315:1860-1862, 2007. 


. E. M. Callaway. Local circuits in primary visual cortex of the macaque monkey. Annual Review of 


Neuroscience, 21:47-74, 1998. 


. G. Deco and E. T. Rolls. A neurodynamical cortical model of visual attention and invariant object 


recognition. Vision Research, 40:2845-2859, 2004. 


. P. J. Drew and L. E Abbott. Extending the effects of spike-timing-dependent plasticity to behav- 


ioral timescales. Proceedings of the National Academy of Sciences of the United States of America, 
103(23):8876-8881, 2006. 


. A. Ekstrom, M. Kahana, J. Caplan, T. Fields, E. Isham, E. Newman, and I. Fried. Cellular networks 


underlying human spatial navigation. Nature, 425:184-188, 2003. 


. J. L. Elman, E. A. Bates, M. H. Johnson, A. Karmiloff-Smith, D. Parisi, and K. Plunkett. Rethinking 


Innateness: A Connectionist Perspective on Development. MIT Press, Cambridge, MA, 1997. 


. D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in the primate cerebral 


cortex. Cerebral Cortex, 1:1-47, 1991. 

M. D. Fox, M. Corbetta, A. Z. Snyder, J. L. Vincent, and M. E. Raichle. Spontanneous neuronal activ- 
ity distinguishes human dorsal and ventral attention systems. Proceedings of the National Academy 
of Sciences of the United States of America, 103(26):10046-10051, 2006. 

K. Grill-Spector, N. Knouf, and N. Kanwisher. The fusiform face area subserves face perception, not 
generic within-category identification. Nature Neuroscience, 7(5):555-562, 2004. 

S. Grossberg and R. Raizada. Contrast-sensitive perceptual grouping and object-based attention in 
the laminar circuits of primary visual cortex. Vision Research, 40:1413-1432, 2000. 

G. E. Hinton. Learning multiple layers of representation. Trends in Cognitive Science, 11(10):428-434, 2007. 
J. M. Hupe, A. C. James, B. R. Payne, S. G. Lomber, P. Girard, and J. Bullier. Cortical feedback 
improves discrimination between figure and background by v1, v2 and v3 neurons. Nature, 394:784- 
787, August 20, 1998. 

I. Ito, R. C. Ong, B. Raman, and M. Stopfer. Sparse odor representation and olfactory learning. 
Nature Neuroscience, 11(10):1177-1184, 2008. 

L. Itti and C. Koch. Computational modelling of visual attention. Nature Reviews Neuroscience, 
2:194-203, 2001. 

L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. 
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254-1259, November 1998. 
Z. Ji and J. Weng. WWN-2: A biologically inspired neural network for concurrent visual atten- 
tion and recognition. In Proceedings of the IEEE International Joint Conference on Neural Networks, 
Barcelona, Spain, July 18-23, 2010. 


© 2011 by Taylor and Francis Group, LLC 


31-16 Intelligent Systems 


19. Z. Ji, J. Weng, and D. Prokhorov. Where-what network 1: “Where” and “What” assist each other 
through top-down connections. In Proceedings of the IEEE International Conference on Development 
and Learning, Monterey, CA, Aug. 9-12, 2008, pp. 61-66. 

20. R. R. Johnson and A. Burkhalter. Microcircuitry of forward and feedback connections within rat 
visual cortex. Journal of Comparative Neurology, 368(3):383-398, 1996. 

21. N. Kanwisher, D. Stanley, and A. Harris. The fusiform face area is selective for faces not animals. 
NeuroReport, 10(1):183-187, 1999. 

22. L. C. Katz and E. M. Callaway. Development of local circuits in mammalian visual cortex. Annual 
Review of Neuroscience, 15:31-56, 1992. 

23. H. Kennedy and J. Bullier. A double-labelling investigation of the afferent connectivity to cortical 
areas v1 and v2 of the macaque monkey. Journal of Neuroscience, 5(10):2815-2830, 1985. 

24. C. Koch and S. Ullman. Shifts in selective visual attention: Towards the underlying neural circuitry. 
Human Neurobiology, 4:219-227, 1985. 

25. T. S. Kuhn. The Structure of Scientific Revolutions, 2nd edn. University of Chicago Press, Chicago, 
IL, 1970. 

26. T.S. Lee and D. Mumford. Hierarchical bayesian inference in the visual cortex. Journal of the Optical 
Society of America A, 20(7):1434-1448, 2003. 

27. M. Luciw and J. Weng. Where What Network 3: Developmental top-down attention with multi- 
ple meaningful foregrounds. In Proceedings of the IEEE International Joint Conference on Neural 
Networks, Barcelona, Spain, July 18-23, 2010. 

28. M. D. Mauk and D. V. Buonomano. The neural basis of temporal processing. Annual Review of 
Neuroscience, 27:307-340, 2004. 

29. J. L. McClelland. The interaction of nature and nurture in development: A parallel distributed pro- 
cessing perspective. In P. Bertelson, P. Eelen, and G. d’Ydewalle (eds.), International Perspectives on 
Psychological Science. Leading Themes Vol. 1, Erlbaum, Hillsdale, NJ, 1994, pp. 57-88. 

30. M. Mishkin, L. G. Unterleider, and K. A. Macko. Object vision and space vision: Two cortical path- 
ways. Trends in Neuroscicence, 6:414-417, 1983. 

31. J. Moran and R. Desimone. Selective attention gates visual processing in the extrastrate cortex. 
Science, 229(4715):782-784, 1985. 

32. J. O'Keefe and J. Dostrovsky. The hippocampus as a spatial map: Preliminary evidence from unit 
activity in the freely-moving rat. Brain Research, 34(1):171-175, 1971. 

33. B. A. Olshausen, C. H. Anderson, and D. C. Van Essen. A neurobiological model of visual attention 
and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 
13(11):4700-4719, 1993. 

34. B. A. Olshaushen and D. J. Field. Emergence of simple-cell receptive field properties by learning a 
sparse code for natural images. Nature, 381:607-609, June 13, 1996. 

35. D. J. Perkel, J. Bullier, and H. Kennedy. Topography of the afferent connectivity of area 17 of the 
macaque monkey. Journal of Computational Neuroscience, 253(3):374-402, 1986. 

36. W. K. Purves, D. Sadava, G. H. Orians, and H. C. Heller. Life: The Science of Biology, 7th edn. Sinauer, 
Sunderland, MA, 2004. 

37. S. Quartz and T. J. Sejnowski. The neural basis of cognitive development: A constructivist manifesto. 
Behavioral and Brain Sciences, 20(4):537-596, 1997. 

38. M. Rabinovich, R. Huerta, and G. Laurent. Transient dynamics for neural processing. Science, 
321:48-50, 2008. 

39. R. P.N. Rao and D. H Ballard. Probabilistic models of attention based on iconic representations and 
predictive coding. In L. Itti, G. Rees, and J. Tsotsos (eds.), Neurobiology of Attention. Academic Press, 
New York, 2004. 

40. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Upper Saddle 
River, NJ, 1995. 


© 2011 by Taylor and Francis Group, LLC 


Autonomous Mental Development 31-17 


41. 


42. 


43. 


44, 


45, 


46. 


47. 
48. 


49. 


50. 


51. 


52. 


53. 


54. 


55. 


56. 


57. 


P. A. Salin and J. Bullier. Corticocortical connections in the visual system: structure and function. 
Physiological Review, 75(1):107-154, 1995. 

K. Schill, E. Umkehrer, S. Beinlich, G. Krieger, and C. Zetzsche. Scene analysis with saccadic eye move- 
ments: Top-down and bottom-up modeling. Journal of Electronic Imaging, 10(1):152-160, 2001. 

J. T. Serences and S. Yantis. Selective visual attention and perceptual coherence. Trends in Cognitive 
Sciences, 10(1):38-45, 2006. 

T. J. Sullivan and V. R. de Sa. A model of surround suppression through cortical feedback. Neural 
Networks, 19:564-572, 2006. 

J. K. Tsotsos, S$. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, and E Nuflo. Modeling visual attention 
via selective tuning. Artificial Intelligence, 78:507-545, 1995. 

L. G. Ungerleider and M. Mishkin. Two cortical visual systems. In D. J. Ingel (ed.), Analysis of Visual 
Behavior. MIT Press, Cambridge, MA, 1982, pp. 549-586. 

J. Weng. On developmental mental architectures. Neurocomputing, 70(13-15):2303-2323, 2007. 

J. Weng. Task muddiness, intelligence metrics, and the necessity of autonomous mental develop- 
ment. Minds and Machines, 19(1):93-115, 2009. 

J. Weng. The SWWN connectionist models for the cortical architecture, spatiotemporal represen- 
tations and abstraction. In Proceedings of the Workshop on Bio-Inspired Self-Organizing Robotic 
Systems, IEEE International Conference on Robotics and Automation, Anchorage, AK, May 3-8, 
2010. 

J. Weng, N. Ahuja, and T. S. Huang. Learning recognition and segmentation of 3-D objects from 
2-D images. In Proceedings of the IEEE Fourth International Conference on Computer Vision, Berlin, 
Germany, May 1993, pp. 121-128. 

J. Weng, N. Ahuja, and T. S. Huang. Learning recognition and segmentation using the Cresceptron. 
International Journal of Computer Vision, 25(2):109-143, November 1997. 

J. Weng and M. Luciw. Dually optimal neuronal layers: Lobe component analysis. IEEE Transactions 
on Autonomous Mental Development, 1(1):68-85, 2009. 

J. Weng, T. Luwang, H. Lu, and X. Xue. Multilayer in-place learning networks for modeling func- 
tional layers in the laminar cortex. Neural Networks, 21:150-159, 2008. 

J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, and E. Thelen. Autonomous 
mental development by robots and animals. Science, 291(5504):599-600, 2001. 

J. Weng and I. Stockman. Autonomous mental development: Workshop on development and 
learning. AI Magazine, 23(2):95-98, 2002. 

J. Weng and N. Zhang. Optimal in-place learning and the lobe component analysis. In Proceedings 
of the IEEE World Congress on Computational Intelligence, Vancouver, Canada, July 16-21, 2006. 

A. K. Wiser and E. M. Callaway. Contributions of individual layer 6 pyramidal neurons to local 
circuitry in macaque primary visual cortex. Journal of Neuroscience, 16:2724-2739, 1996. 


© 2011 by Taylor and Francis Group, LLC 


32 


Synthetic Biometrics for 
Testing Biometric Systems 
and User Training 


Svetlana N. 
Yanushkevich 
University of Calgary 
Adrian Stoica 221 TOUCH aisiscrdndvakamanarmninuninndnanaanandins 32-1 
Jet Propulsion Laboratory Soo: Site te ee Ns ag ete cence nto wlieseeitions 32-2 
Synthetic Fingerprints « Synthetic Iris and Retina 
Ronald R. Yager Images + Synthetic Signatures 
Iona College 32.3. Example of the Application of Synthetic Biometric Data........ 32-4 
Hyperspectral Facial Analysis and Synthesis in Decision-Support 
Oleg Boulanov Assistant « Hyperspectral Analysis-to-Synthesis 3D Face Model 
University of Calgary ‘ Ver Z i 
32.4 Synthetic Data for User Training in Biometric Systems........... 32-8 
Vlad P. Shmerko BA CURE AU CAEL ics scss seccstrcsansicaasarssarnsnnnnensetuineinnislanuemunnrcntinds 32-10 
University of Calgary Tet SenCess ic hunckerienekeuneradadadneaad hunni 32-11 


32.1 Introduction 


Synthetic biometrics are understood as generated biometric data that are biologically meaningful for 
existing biometric systems. These synthetic data replicate possible instances of otherwise unavailable data, 
in particular corrupted or distorted data. For example, facial images, acquired by video cameras, can be 
corrupted due to their position and angle of observation (appearance variation), as well as lighting (envi- 
ronmental conditions), camera resolution, and other parameters (measurement conditions). The other 
reason for the use of synthetic data is the difficulty in collecting a statistically meaningful amount of bio- 
metric samples due to privacy issues and the unavailability of large databases, etc. In order to avoid these 
difficulties, synthetic biometric data can be used as samples, or tests, generated using the controllability 
of various parameters. This renders them capable of being used to test biometric tools and devices [17,28]. 

Synthetic biometric data can also be thought in terms of a forgery of biometric data. Properly created 
artificial biometric data provides an opportunity for the detailed and controlled modeling of a wide 
range of training skills, strategies, and tactics, thus enabling better approaches to enhancing system 
performance. 

Contemporary techniques and achievements in biometrics are being developed in two directions: 
toward the analysis of biometric information (direct problems) and toward the synthesis of biometric 
information (inverse problems) [1,6,11,33,34] (Figure 32.1). 

The crucial point of modeling in biometrics (inverse problems) is the analysis-by-synthesis paradigm. 
This paradigm states that synthesis of biometric data can verify the perceptual equivalence between 
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FIGURE 32.1 Direct (a) and inverse (b) problems of biometrics. 
original and synthetic biometric data, i.e., synthesis-based feedback control. For example, facial analysis 


can be formulated as deriving a symbolic description of a real facial image. The aim of facial synthesis is 
to produce a realistic facial image from a symbolic facial expression model. 


32.2 Synthetic Biometrics 


In this section, examples of synthetic biometrics, such as synthetic fingerprints, iris, retina, and signa- 
tures, are introduced. 


32.2.1 Synthetic Fingerprints 


Today’s interest in automatic fingerprint synthesis addresses the urgent problems of testing fingerprint 
identification systems, training security personnel, enhancing biometric database security, and protect- 
ing intellectual property [8,18,33]. 

Traditionally, two methods for fingerprint imitation are discussed with respect to obtaining unau- 
thorized access to an information system: (1) the authorized user provides his/her fingerprint for mak- 
ing a copy and (2) a fingerprint is taken without the authorized user’s consent, for example, from a glass 
surface (a classic example of spy work) in a routine forensic procedure. 

Cappelli et al. [6,8] developed a commercially available synthetic fingerprint generator called SFinGe. 
In SFinGe, various models of fingerprint topologies are used: shape, directional map, density map, and 
skin deformation models. In Figure 32.2, two topological primitives are composed in various ways. 
These are examples of acceptable (valid) and unacceptable (invalid) synthesized fingerprints. 


FIGURE 32.2 Synthetic fingerprints generated by the SFinGe system: Invalid (a,c) and valid (b,d) topological 
compositions of fingerprint primitives. 
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Kuecken [16] proposed an alternative method for synthetic fingerprint generation based on natural 
fingerprint formation, that is, embryological process. Kuecken’s modeling approach started with an idea 
originating from Kollman (1883) that was promoted by Bonnevie in the 1920. In Kuecken’s generator 
of synthetic fingerprints, the Karmen equations are used to describe the mechanical behavior of a thin 
curved sheet of elastic material. 


32.2.2 Synthetic Iris and Retina Images 


Iris recognition systems scan the surface of the iris to analyze patterns. Retina recognition systems 
scan the surface of the retina and analyze nerve patterns, blood vessels, and such features. Automated 
methods of iris and retina image synthesis have not been developed yet, except for an approach based 
on generation of iris layer patterns [33]. 

A synthetic image can be created by combining segments of real images from a database. Various 
operators can be applied to deform or warp the original iris image: translation, rotation, rendering, etc. 
Various models of the iris, retina, and eye used to improve recognition can be found in [3,4,7,33]. 

An example of the generating of posterior pigment epithelia of the iris using a Fourier transform on 
a random signal is considered below. A fragment of the FFT signal is interpreted as a gray-scaled vector: 
the peaks in the FFT signal represent lighter shades and valleys represent darker shades. This procedure 
is repeated for other fragments as well. The data plotted in 3D, a 2D slice of the data, and a round image 
generated from the slice using a polar transform. The superposition of the patterns of various iris layers 
forms a synthetic iris pattern. Synthetic collarette topology modeled by a randomly generated curve is 
shown in Figure 32.3. Figure 32.3b illustrates three different patterns obtained by this method. Other 
layer patterns can be generated based on wavelet, Fourier, polar, and distance transforms, as well as 
Voronoi diagrams [33]. 


32.2.3 Synthetic Signatures 


The current interest in signature analysis and synthesis is motivated by the development of improved 
devices for human-computer interaction, which enable input of handwriting and signatures. The focus 
of this study is the formal modeling of this interaction [5,15,23]. 

To generate signatures with any automated technique, it is necessary to consider (a) the formal 
description of curve segments and their kinematical characteristics, (b) the set of requirements that 


FIGURE 32.3 Synthetic collarette topology modelled by a randomly generated curve: spectral representation (a) 
and three different synthetic patterns (b). 
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FIGURE 32.4 3D view of an on-line signature: the plain curve is given by the two-tuple (X, Y), the pressure is 
associated with the Z axis, the speed of writing (that is, an additional dimension) is depicted by the shade of the 
curve, where darker is slower speed. (Courtesy of Prof. D. Popel, Baker University, USA.) 


should be met by any signature generation system, and (c) the possible scenarios for signature genera- 
tion. The simplest method of generating synthetic signatures is based on formal 2D geometrical descrip- 
tion of the curve segments. Spline methods and Bezier curves are used for curve approximation, given 
some control points; manipulations to the control points give variations on a single curve in these meth- 
ods [33]. A 3D on-line representation of a signature is given in Figure 32.4. 


32.3 Example of the Application of Synthetic Biometric Data 


In this section, an application of synthetic biometrics for training users of a physical access con- 
trol system (PASS) is introduced. The main purpose of the PASS is the efficient support of security 
personnel enhanced with the situational awareness paradigm and intelligent tools. A registration 
procedure is common in the process of personal identification (checkpoints in homeland and airport 
security applications, hospitals, and other places where secure physical admission is practiced). We 
refer to the Defense Advanced Research Projects Agency (DARPA) research program HumanID, 
which is aimed at the detection, recognition, and identification of humans at a distance in an early 
warning support system for force protection and homeland defense [29]. The access authorization 
process is characterized by insufficiency of information. The result of a customer’s identification is 
a decision under uncertainty made by the user. Uncertainty (incompleteness, imprecision, contra- 
diction, vagueness, unreliability) is understood in the sense that available information allows for 
several possible interpretations, and it is not entirely certain which is the correct one. The user must 
make a decision under uncertainty, that is, select an alternative before any complete knowledge is 
obtained [31,32]. 


‘The architecture of the PASS is shown in Figure 32.5. The system consists of sensors such as cameras in 
the visible and infrared bands, decision-support assistants, and a dialogue support device to support 
conversation based on the preliminary information obtained, and a personal file generating module. 
Three-level surveillance is used in the system: surveillance of the line (prescreening); surveillance dur- 
ing the walk between pre-screened and screened points, and surveillance during the authorization pro- 
cess at the officer’s desk (screening). 
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FIGURE 32.5 ‘The PASS is a semi-automatic, application-specific distributed computer system that aims to 
support the officer’s job in access authorization (left), and typical protocol of the authorization (right). 


Decision-support assistants. The device gathered from the sensors and intelligent data processing for the 
situational awareness is called decision-support assistant. The PASS is a distributed computing network 
of biometric-based assistants. For example, an assistant can be based on noninvasive metrics such as 
temperature measurement, artificial accessory detection, estimation of drug and alcohol intoxication, 
and estimation of blood pressure and pulse [9,25,35,36]. The decision support is built upon the dis- 
criminative biometric analysis that means detecting features used for evaluation the physiological and 
psychoemotional states of a person. Devices for various biometrics can be used as the kernels of decision 
support assistants. The most assistants in PASS are multipurpose devices, that is, they can be used for 
the person authorization and user training. 


‘The role of biometric device. In the PASS, the role of each biometric device is twofold: their primary func- 
tion is to extract biometric data from an individual, and their secondary function is to support a dialog 
of a user and a customer. For example, if high temperature is detected, the question to this customer 
should be formulated as follows: “Do you need any medical assistance?” The key of the concept of the 
dialog support in PASS is the process of generating questions initiated by information from biometric 
devices. In this way, the system assists the user in access authorization. 


‘The time of service, T, can be divided into three phases: T, (the prescreening phase of service or waiting), 
T, (individual’s movement from the prescreened position to the officer’s desk), and T;, (the time of iden- 
tification (document-check) and authorization). 
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32.3.1 Hyperspectral Facial Analysis and Synthesis 
in Decision-Support Assistant 


The concept of a multipurpose platform is applied to the decision-support assistants, including an assis- 
tant for hyperspectral face analysis. 


Hyperspectral face analysis. The main goal of face analysis in infrared band is detecting physical param- 
eters such as temperature, blood flow rate and pressure, as well as physiological changes, caused by 
alcohol and substances. Another useful application of the infrared image analysis is detection of artifi- 
cial accessories such as artificial hair and plastic surgery. This data can provide valuable information to 
support interviewing of the customers in the systems like PASS. 

There are several models, which are implemented in decision-support assistant for hyperspectral 
facial analysis and synthesis, namely, a skin color model and a 3D hyperspectral face model. 


Skin color models. In [30], a skin color model was developed and shown to be useful for the detection of 
changes due to alcohol intoxication. The fluctuation of temperature in various facial regions is primary 
due to the changing blood flow rate. In [14], the heat-conduction formulas at the skin surface are intro- 
duced. In [19], mass blind screening of potential SARS or bird flu patients was studied. 

The human skin has a layered structure and the skin color is determined by how incident light is 
absorbed and scattered by the melanin and hemoglobin pigments in two upper skin layers, epidermis, 
and dermis. 

The color of human skin can reveal distinct characteristics valuable for diagnostics. The dominant 
pigments in skin color formation are melanin and hemoglobin. Melanin and hemoglobin determine the 
color of the skin by selectively absorbing certain wavelengths of the incident light. The melanin has a 
dark brown color and predominates in the epidermal layer while the hemoglobin has a reddish hue or 
purplish color, depending on the oxygenation, and is found mainly in the dermal layer. It is possible to 
obtain quantitative information about hemoglobin and melanin by fitting the parameters of an analyti- 
cal model with reflectance spectra. In [20], a method for visualizing local blood regions in the skin tissue 
using diffuse reflectance images was proposed. 

A quantitative analysis of human skin color and temperature distribution can reveal a wide range of 
physiological phenomena. For example, skin color and temperature can change due to drug or alcohol 
consumption, as well as physical exercises [2]. 


32.3.2 Hyperspectral Analysis-to-Synthesis 3D Face Model 


A decision-support assistant performs the hyperspectral face analysis based on a model that includes 
two constituents: a face shape model (represented by a 3D geometric mesh) and a hyperspectral skin 
texture model (generated from images in visible and infrared bands). The main advantage of a 3D face 
modeling is that the effect of variations in illumination, surface reflection, and shading from direc- 
tional light can be significantly decreased. For example, a 3D model can provide controlled variations in 
appearance while the pose or illumination is changed. Also, the estimations of facial expressions can be 
made more accurately in 3D models compared with 2D models. 

A face shape is modeled by a polygonal mesh, while the skin is represented by texture map images in 
visible and infrared bands (Figure 32.6). Any individual face shape can be generated from the generic 
face model by specifying 3D displacements for each vertex. Synthetic face images are rendered by map- 
ping the texture image on the mesh model. 

Face images in visible and infrared bands, acquired by the sensors, constitute the input of the module 
for hyperspectral face analysis and synthesis. The corresponding 3D models, one for video and one for 
infrared images, are generated by fitting the generic model to images (Figure 32.6). 

The texture maps represent the hemoglobin and melanin content and the temperature distribution of 
the facial skin. These maps are the output of the face analysis and modeling module. This information is 
used for evaluating the physical and psychoemotional state of a person. 
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FIGURE 32.6 The generic 3D polygonal mesh, skin texture, and the resulting 3D rendered model (top). Face 
images in visible and infrared bands and their 3D models (down). 


Facial action analysis. Face models are considered to convey emotions. In the systems like PASS, emo- 
tions play the role of “indicators” used for decision-making about the emotional and physiological state 
of the customer and for generating questions for further dialog. Visual band images along with thermal 
(infrared) images can be used in this task [21,22,27]. Facial expressions are formed by about 50 facial 
muscles [12] and are controlled by dozens of parameters in the model (Figure 32.7). The facial expression 
can be identified once the facial action units are recognized. This task involves facial feature extraction 
(eyes, eyebrow, nose, lips, chin lines), measuring geometric distances between the extracted points/ 
lines, and then facial action units recognition based on these measurements. Decision-making is based 
on the analysis of changes in facial expression while a person listens and responds to questions. 


Inner brow raiser | Frontalis, pars medialis 
Outer brow raiser | Frontalis, pars lateralis 
Upper lid raiser Levator palpebrae, superioris 
Cheek raiser Orbicularis oculi, pars palebralis 
Lip corner puller Zygomatic major 

Cheek puffer Caninus 

Chin raiser Mentalis 

Lip stretcher Risorius 

Lip funneler Orbicularis oris 

Lip tightner Orbicularis oris 

Mouth stretch Pterygoid, digastric 

Lip suck Orbicularis oris 

Nostril dilator Nasalis, pars alaris 

Slit Orbicularis oculi 


(a) (b) 


FIGURE 32.7 A 3D facial mesh model (a) and fragment of corresponding facial action units (b). 
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FIGURE 32.8 A setup ofa pair of video and infrared cameras for surveillance (a) and experimental equipment 
for a 3D hyperspectral face modeling (b). 


Hyperspectral data acquisition. A setup of paired video and thermal cameras for acquisition of facial 
images in both visible and infrared bands is shown in Figure 32.8. Two cameras can acquire full resolu- 
tion images. Infrared facial images are provided by an uncooled microbolometer infrared camera. The 
network of various assistants is based on a PC station with acquisition boards. 


32.4 Synthetic Data for User Training in Biometric Systems 


The basic concept of the PASS is the collaboration of the user, the customer, and the machine. This is a 
dialogue-based interactions. Based on the premise that the user has priority in the decision making at the 
highest level of the system hierarchy, the role of the machine is defined as assistance, or support of the user. 

The training methodology should be short-term, periodically repeated, and intensive. The PASS can 
be used as a training system (with minimal extension of tools) without changing of the place of deploy- 
ment. In this way, we fulfill the criterion of cost efficiency and satisfy the above requirements. 

Simulation of extreme scenarios is aimed at developing the particular skills of the personnel. The 
modeling of extreme situations requires developing specific training methodologies and techniques, 
including virtual environments. 


Scenarios of decision-making support. The possible scenarios are divided into three groups: regular, non- 
standard, and extreme. Let us consider an example ofa scenario, in which the system generates the fol- 
lowing data about the screened person. 
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Protocol for person #45 under Protocol for person #45 under 
pre-screening screening 
Ti me: 12. 00. 00: Time: 12.10. 20: 
Varning: level 04 Vérning: level 04 
Speci fi cation: Drug or al cohol Speci fication: Drug or alcohol 


intoxication, level 03 intoxication, level D3 

Possi bl e acti on: Local database matching: positive 

1. Database inquiry Possi bl e acti on: 

2. Carify in the dialogue 1. Further inquiry using di al ogue 
2. Drect to the special inspection 


FIGURE 32.9 Scenarios for user training: protocol of pre-screening (left) and screening (right). 


Protocol of the person #45 under Protocol of the person #45 under 
screening screening (continuation) 


Ui ae 00. 00. 00: Level of trustworthiness of Question 
Vérning, level 04 is 02: 


Speci fication: Drug or al cohol Level trustworthiness of Question 
consunption, |evel U3) is 02: 


Local database matching: positive Level trustworthiness of Question 
Proposed di al ogue questi ons: is 03 


Question 1: Do you need any medi cal Level trustworthiness of Question 
assi st ance? is 00: 


Question 2: Any service probl ens 
during the flight? is 03: 
Question 3: Do you plan to rent a 
car? is 03: 
Question 4 Dd you meet friends on 

boar d? 


Level trustworthiness of Question 
Level trustworthiness of Question 
Possi ble action: 


1 Drect to special inspection 
2. Further inquiry using di al ogue 


Question 5: Did you consune woe or 
whi sky aboar d? 

Question 6 Do you have drugs in 
your | uggage? 


FIGURE 32.10 Protocol of the person during screening: the question generation (left) and their analysis (right) 
with corresponding level of trustworthiness. 


According to the protocol shown in Figure 32.9, left, that the system estimates the third level of warn- 
ing using automatically measured drug or alcohol intoxication for the screened customer. A knowledge- 
based subsystem evaluates the risks and generates two possible solutions. The user can, in addition to the 
automated analysis, evaluate the images acquired in the visible and infrared spectra. 

The example in Figure 32.10 (left) introduces a scenario based on the analysis of behavioral biometric 
data. The results of the automated analysis of behavioral information are presented to the user (Figure 
32.10, right). Let us assume that there are three classes of samples assigned to “Disability,” “Alcohol intoxi- 
cation,” and “Normal.” The following linguistic constructions can be generated by the system: Not enough 
data, but abnormality is detected, or Possible alcohol intoxication, or An individual with a disability. 

The user must be able to communicate effectively with the customer in order to minimize uncer- 
tainty. Limited information will be obtained if the customer does not respond to inquiries or if his/her 
answers are not understood. We distinguish two types of uncertainty about the customer: the uncer- 
tainty that can be minimized by using customer responses, his/her documents, and information from 
databases; and the uncertainty of appearance (physiological and behavior) information such as specific 
features in the infrared facial image, gait, and voice. In particular, facial appearance alternating the 
document photos can be modeled using a software that models aging. The uncertainty of appearance 
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can be minimized by specifically oriented questionnaire techniques. These techniques have been used 
in criminology, in particular, for interviewing and interrogation. The output of each personal assistant 
is represented in semantic form. The objective of each semantic construction is the minimization of 
uncertainly, that is, (a) choosing an appropriate set of questions (expert support) from the database, 
(b) alleviating the errors and temporal faults of biometric devices, and (c) maximizing the correlation 
between various biometrics. 

Deception can be defined as a semantic attack that is directed against the decision-making process. 
Technologies for preventing, detecting, and prosecuting semantic attacks are still in their infancy. Some 
techniques of forensic interviewing and interrogation formalism with elements of detecting the seman- 
tic attack are useful in dialogue development. In particular, in training system, modeling is replaced by 
real-world conditions, and long-term training is replaced by periodically repeated short-term intensive 
computer-aided training. 


The PASS extension for user training. In PASS, an expensive training system is replaced by an inex- 
pensive extension of the PASS, already deployed at the place of application. In this way, an impor- 
tant effect is achieved: complicated and expensive modeling is replaced with real-world conditions, 
except some particular cases considered in this chapter. Furthermore, long-term training is replaced 
by periodically repeated short-time intensive computer-aided training. The PASS and T-PASS imple- 
ment the concept of multi-target platforms, that is, the PASS can be easy reconfigured into the 
T-PASS and vice versa. 


32.5 Other Applications 


Simulators of biometric data are emerging technologies for educational and training purposes (immi- 
gration control, banking service, police, justice, etc.). They emphasize decision-making skills in non- 
standard and extreme situations. 


Data bases for synthetic biometric data. Imitation of biometric data allows the creation of databases with 
tailored biometric data without expensive studies involving human subjects. An example of tool used 
to create databases for fingerprints is SFinGe system [6]. The generated databases were included in the 
Fingerprint Verification Competition FVC2004 and perform just as well as real fingerprints. 


Synthetic speech and singing voices. A synthetic voice should carry information about age, gender, emo- 
tion, personality, physical fitness, and social upbringing [10]. A closely related but more complicated 
problem is generating a synthetic singing voice for the training of singers, by studying famous singers’ 
styles and designing synthetic user-defined styles combining voice with synthetic music. An example 
of a direct biometric problem is identifying speech, given a video fragment without recorded voice. The 
inverse problem is mimicry synthesis (animation) given a text to be spoken (synthetic narrator). 


Cancelable biometrics. The issue of protecting privacy in biometric systems has inspired the direction 
research referred to as cancelable biometrics [4]. Cancelable biometrics is aimed at enhancing the secu- 
rity and privacy of biometric authentication through the generation of “deformed” biometric data, that 
is, synthetic biometrics. Instead of using a true object (finger, face), the fingerprint or face image is inten- 
tionally distorted in a repeatable manner, and this new print or image is used. 


Caricature is the art of making a drawing of a face, which makes part of its appearance more noticeable 
than it really is, and which can make a person look ridiculous. Specifically, a caricature is a synthetic 
facial expression, in which the distances of some feature points from the corresponding positions in the 
normal face have been exaggerated. The reason why the art-style of the caricaturist is of interest for image 
analysis, synthesis, and especially for facial expression recognition and synthesis is as follows [13]. Facial 
caricatures incorporate the most important facial features and a significant set of distorted features. 


Lie detectors. Synthetic biometric data can be used in the development of a new generation of lie detec- 
tors [12,24,33]. For example, behavioral biometric information is useful in evaluation of truth in answers 
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to questions, or evaluating the honesty of a person in the process of speaking [12]. Emotions contribute 
additionally to temperature distribution in the infrared facial image. 

Humanoid robots are artificial intelligence machines whose design demands the resolution of certain 
direct and inverse biometric problems, such as, language technologies, recognition by means of facial 
expressions and gestures of the “mood” of instructor, following of cues; dialogue and logical reasoning; 
vision, hearing, olfaction, tactile, and other senses [26]. 


Ethical and social aspects of synthetic biometrics. Particular examples of the negative impact of synthetic bio- 
metrics are as follows: (a) Synthetic biometric information can be used not only for improving the character- 
istics of biometric devices and systems, but also by forgers to discover new strategies of attack. (b) Synthetic 
biometric information can be used for generating multiple copies of original biometric information. 
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